This patch increases the amount of output produced by the
irq_domain_mapping debugfs file by first listing all of the registered
irq domains at the beginning of the output, and then by including all
mapped IRQs in the output, not just the active ones. It is very useful
when debugging irqdomain issues to be able to see the entire list of
mapped irqs, not just the ones that happen to be connected to devices.
Signed-off-by: Grant Likely <grant.likely@linaro.org>
After refactoring the irqdomain code, there are a number of API
functions that are merely empty wrappers around core code. Drop those
wrappers out of the C file and replace them with static inlines in the
header.
Signed-off-by: Grant Likely <grant.likely@linaro.org>
The NOMAP irq_domain type is only used by a handful of interrupt
controllers and it unnecessarily complicates the code by adding special
cases on how to look up mappings and different revmap functions are used
for each type which need to validate the correct type is passed to it
before performing the reverse map. Eliminating the revmap_type and
making a single reverse mapping function simplifies the code. It also
shouldn't be any slower than having separate revmap functions because
the type of the revmap needed to be checked anyway.
The linear and tree revmap types were already merged in a previous
patch. This patch rolls the NOMAP or direct mapping behaviour into the
same domain code making is possible for an irq domain to do any mapping
type; linear, tree or direct; and that the mapping will be transparent
to the interrupt controller driver.
With this change, direct mappings will get stored in the linear or tree
mapping for consistency. Reverse mapping from the hwirq to virq will go
through the normal lookup process. However, any controller using a
direct mapping can take advantage of knowing that hwirq==virq for any
mapped interrupts skip doing a revmap lookup when handling IRQs.
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Keeping them separate makes irq_domain more complex and adds a lot of
code (as proven by the diffstat). Merging them simplifies the whole
scheme. This change makes it so both the tree and linear methods can be
used by the same irq_domain instance. If the hwirq is less than the
->linear_size, then the linear map is used to reverse map the hwirq.
Otherwise the radix tree is used. The test for which map to use is no
more expensive that the existing code, so the performance of fast path
is preserved.
It also means that complex interrupt controllers can use both the
linear map and a tree in the same domain. This may be useful for an
interrupt controller with a base set of core irqs and a large number
of GPIOs which might be used as irqs. The linear map could cover the
core irqs, and the tree used for thas irqs. The linear map could
cover the core irqs, and the tree used for the gpios.
v2: Drop reorganization of revmap data
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
This patch adds a name field to the irq_domain structure to help mere
mortals understand the mappings between irq domains and virqs. It also
converts a number of places that have open-coded some kind of fudging
an irqdomain name to use the new field. This means a more consistent
display of names in irq domain log messages and debugfs output.
Signed-off-by: Grant Likely <grant.likely@linaro.org>
The LEGACY mapping unnecessarily complicates the irqdomain code and
can easily be implemented with a linear mapping. By ripping it out
and replacing it with the LINEAR mapping the object size of
irqdomain.c shrinks by about 330 bytes (ARMv7) which offsets the
additional allocation required by the linear map. It also makes it
possible for current LEGACY map users to pre-allocate irq_descs for a
subset of the hwirqs and dynamically allocate the rest as needed.
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
Commit 98aa468e, "irqdomain: Support for static IRQ mapping and
association" introduced an API for directly associating blocks of hwirqs
to linux irqs. However, if any irq in that block failed to map (say if
the mapping functions returns an error because the irq is already
mapped) then the whole thing will fail and roll back. This is probably
too aggressive since there are valid reasons why a mapping may fail.
ie. Firmware may have a particular IRQ marked as unusable.
This patch drops the error path out of irq_domain_associate(). If a
mapping fails, then it is simply skipped. There is no reason to fail the
entire allocation.
v2: Still output an information message on failed mappings and make sure
attempted mapping gets cleared out of the irq_data structure.
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
As we can drop rcu read lock while iterating cgroup hierarchy,
we don't have to do propagation asynchronously via workqueue.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Instead of triggering propagation work in cpuset_attach(), we make
hotplug propagation work wait until there's no task attaching in
progress.
IMO this is more robust. We won't see empty masks in cpuset_attach().
Also it's a preparation for removing propagation work. Without asynchronous
propagation we can't call move_tasks_in_empty_cpuset() in cpuset_attach(),
because otherwise we'll deadlock on cgroup_mutex.
tj: typo fixes.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull timer fixes from Thomas Gleixner:
- Trivial: unused variable removal
- Posix-timers: Add the clock ID to the new proc interface to make it
useful. The interface is new and should be functional when we reach
the final 3.10 release.
- Cure a false positive warning in the tick code introduced by the
overhaul in 3.10
- Fix for a persistent clock detection regression introduced in this
cycle
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Correct run-time detection of persistent_clock.
ntp: Remove unused variable flags in __hardpps
posix-timers: Show clock ID in proc file
tick: Cure broadcast false positive pending bit warning
This branch contains a set of straight forward bug fixes to the
irqdomain code and to a couple of drivers that make use of it.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJRs51nAAoJEEFnBt12D9kBmnUP/RCgaTn5biRD0tC6OCGwvsZr
YKypc71cZtbO3CTk1Sw2jgDUoW+2FWwtbwKWCmrHIaulRuxoeMHLbpc6fEGFRAjG
ENnSEuSJkYV2T5ZoYjM5mAjotHUcszxZ9uOz7ovCUY72GO/+tfJ97NT9+CCpPfWV
Wa/i08/91UPbWP1ASfMLXVzqO9uqEYvrrvY2PSqJ/g0BkzbybAg38u6IycZkGW4u
/mjglx5fYRhcQgl7o1FDaw97AGjbykt2mgP7EK3R24BxvEy4gmn4IzGo9duOf7Y2
b1tEfro/keRoibuKehPWdKTvpda80DUJjrsOwmNveZHTWlSB8GZXqCEmOmTHngrV
gNX6MUVZClUvKiQCDo3ibyZUmIuUnnlRee6WqQzr2VsMiwct449Gg81zwXX+Yn7O
5KOnlyicJur3f4HqQSKEA2CXU6RRCmk2iqCFMqtutxy20cmm3LoW7OM7rFF7tzix
g6czKZiX+yKwoP2E2EQ2mYM8cirKeEyPhs4EUnKJJOVVZqOCtHkrKnkbSoithsS3
we6Isj8KM8NQ3fgeFsbcxV+ezK3moIzD0fYr3Q6x25VZLYrYH7XpUix0nlGYxCOK
vlEpCaMes/IG/+SKElf8fPoxs0qlOYPvYZBrLjUGCG/VB01bNsj0mjKYm1va+f6v
n3zQbGS7X+TiiHQ+EFL0
=wZCk
-----END PGP SIGNATURE-----
Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux
Pull irqdomain bug fixes from Grant Likely:
"This branch contains a set of straight forward bug fixes to the
irqdomain code and to a couple of drivers that make use of it."
* tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux:
irqchip: Return -EPERM for reserved IRQs
irqdomain: document the simple domain first_irq
kernel/irq/irqdomain.c: before use 'irq_data', need check it whether valid.
irqdomain: export irq_domain_add_simple
The first_irq needs to be zero to get a linear domain and that
comes with special semantics. We want to simplify this going
forward but some documentation never hurts.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Since irq_data may be NULL, if so, we WARN_ON(), and continue, 'hwirq'
which related with 'irq_data' has to initialize later, or it will cause
issue.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Grant Likely <grant.likely@linaro.org>
All other irq_domain_add_* functions are exported already, and apparently
this one got left out by mistake, which causes build errors for ARM
allmodconfig kernels:
ERROR: "irq_domain_add_simple" [drivers/gpio/gpio-rcar.ko] undefined!
ERROR: "irq_domain_add_simple" [drivers/gpio/gpio-em.ko] undefined!
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Grant Likely <grant.likely@linaro.org>
The first two fix the case where full RCU debugging is enabled, enabling
function tracing causes a live lock of the system. This is due to the added
debug checks in rcu_dereference_raw() that is used by the function tracer.
These checks are also traced by the function tracer as well as cause enough
overhead to the function tracer to slow down the system enough that
the time to finish an interrupt can take longer than when the next
interrupt is triggered, causing a live lock from the timer interrupt.
Talking this over with Paul McKenney, we came up with a fix that adds
a new rcu_dereference_raw_notrace() that does not perform these added checks,
and let the function tracer use that.
The third commit fixes a failed compile when branch tracing is enabled,
due to the conversion of the trace_test_buffer() selftest that the
branch trace wasn't converted for.
The forth patch fixes a bug caught by the RCU lockdep code where a
rcu_read_lock() is performed when rcu is disabled (either going to
or from idle, or user space). This happened on the irqsoff tracer
as it calls task_uid(). The fix here was to use current_uid() when
possible that doesn't use rcu locking. Which luckily, is always used
when irqsoff calls this code.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJRsQZhAAoJEOdOSU1xswtMquIH/0zyrqrLTnkc5MsNnnJ8kH5R
z1cULts4FqBTUNZ1hdb3BTOu4zywjREIkWfM9qqpBmq9Mq6PBxX7gxWTqYvD4jiX
EatiiCKa7Fyddx4iHJNfvtWgKVYt9WKSNeloRugS9h7NxIZ1wpz21DUpENFQzW2f
jWRnq/AKXFmZ0vn1953mPePtRsg61RYpb7DCkTE1gtUnvL43wMd/Mo6p6BLMEG26
1dDK6EWO/uewl8A4oP5JZYP+AP5Ckd4x1PuQK682AtQw+8S6etaGfeJr0WZmKQoD
0aDZ/NXXSNKChlUFGJusBNJCWryONToa+sdiKuk1h/lW/k9Mail/FChiHBzMiwk=
=uvlD
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.10-rc3-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"This contains 4 fixes.
The first two fix the case where full RCU debugging is enabled,
enabling function tracing causes a live lock of the system. This is
due to the added debug checks in rcu_dereference_raw() that is used by
the function tracer. These checks are also traced by the function
tracer as well as cause enough overhead to the function tracer to slow
down the system enough that the time to finish an interrupt can take
longer than when the next interrupt is triggered, causing a live lock
from the timer interrupt.
Talking this over with Paul McKenney, we came up with a fix that adds
a new rcu_dereference_raw_notrace() that does not perform these added
checks, and let the function tracer use that.
The third commit fixes a failed compile when branch tracing is
enabled, due to the conversion of the trace_test_buffer() selftest
that the branch trace wasn't converted for.
The forth patch fixes a bug caught by the RCU lockdep code where a
rcu_read_lock() is performed when rcu is disabled (either going to or
from idle, or user space). This happened on the irqsoff tracer as it
calls task_uid(). The fix here was to use current_uid() when possible
that doesn't use rcu locking. Which luckily, is always used when
irqsoff calls this code."
* tag 'trace-fixes-v3.10-rc3-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Use current_uid() for critical time tracing
tracing: Fix bad parameter passed in branch selftest
ftrace: Use the rcu _notrace variants for rcu_dereference_raw() and friends
rcu: Add _notrace variation of rcu_dereference_raw() and hlist_for_each_entry_rcu()
When param1 is enabled in EINJ but not assigned with a valid
value, sometimes it will cause the error like below:
APEI: Can not request [mem 0x7aaa7000-0x7aaa7007] for APEI EINJ Trigger registers
It is because some firmware will access target address specified in
param1 to trigger the error when injecting memory error. This will
cause resource conflict with regular memory. So It must be removed
from trigger table resources, but incorrect param1/param2
combination will stop this action. Add extra check to avoid
this kind of error.
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The irqsoff tracer records the max time that interrupts are disabled.
There are hooks in the assembly code that calls back into the tracer when
interrupts are disabled or enabled.
When they are enabled, the tracer checks if the amount of time they
were disabled is larger than the previous recorded max interrupts off
time. If it is, it creates a snapshot of the currently running trace
to store where the last largest interrupts off time was held and how
it happened.
During testing, this RCU lockdep dump appeared:
[ 1257.829021] ===============================
[ 1257.829021] [ INFO: suspicious RCU usage. ]
[ 1257.829021] 3.10.0-rc1-test+ #171 Tainted: G W
[ 1257.829021] -------------------------------
[ 1257.829021] /home/rostedt/work/git/linux-trace.git/include/linux/rcupdate.h:780 rcu_read_lock() used illegally while idle!
[ 1257.829021]
[ 1257.829021] other info that might help us debug this:
[ 1257.829021]
[ 1257.829021]
[ 1257.829021] RCU used illegally from idle CPU!
[ 1257.829021] rcu_scheduler_active = 1, debug_locks = 0
[ 1257.829021] RCU used illegally from extended quiescent state!
[ 1257.829021] 2 locks held by trace-cmd/4831:
[ 1257.829021] #0: (max_trace_lock){......}, at: [<ffffffff810e2b77>] stop_critical_timing+0x1a3/0x209
[ 1257.829021] #1: (rcu_read_lock){.+.+..}, at: [<ffffffff810dae5a>] __update_max_tr+0x88/0x1ee
[ 1257.829021]
[ 1257.829021] stack backtrace:
[ 1257.829021] CPU: 3 PID: 4831 Comm: trace-cmd Tainted: G W 3.10.0-rc1-test+ #171
[ 1257.829021] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
[ 1257.829021] 0000000000000001 ffff880065f49da8 ffffffff8153dd2b ffff880065f49dd8
[ 1257.829021] ffffffff81092a00 ffff88006bd78680 ffff88007add7500 0000000000000003
[ 1257.829021] ffff88006bd78680 ffff880065f49e18 ffffffff810daebf ffffffff810dae5a
[ 1257.829021] Call Trace:
[ 1257.829021] [<ffffffff8153dd2b>] dump_stack+0x19/0x1b
[ 1257.829021] [<ffffffff81092a00>] lockdep_rcu_suspicious+0x109/0x112
[ 1257.829021] [<ffffffff810daebf>] __update_max_tr+0xed/0x1ee
[ 1257.829021] [<ffffffff810dae5a>] ? __update_max_tr+0x88/0x1ee
[ 1257.829021] [<ffffffff811002b9>] ? user_enter+0xfd/0x107
[ 1257.829021] [<ffffffff810dbf85>] update_max_tr_single+0x11d/0x12d
[ 1257.829021] [<ffffffff811002b9>] ? user_enter+0xfd/0x107
[ 1257.829021] [<ffffffff810e2b15>] stop_critical_timing+0x141/0x209
[ 1257.829021] [<ffffffff8109569a>] ? trace_hardirqs_on+0xd/0xf
[ 1257.829021] [<ffffffff811002b9>] ? user_enter+0xfd/0x107
[ 1257.829021] [<ffffffff810e3057>] time_hardirqs_on+0x2a/0x2f
[ 1257.829021] [<ffffffff811002b9>] ? user_enter+0xfd/0x107
[ 1257.829021] [<ffffffff8109550c>] trace_hardirqs_on_caller+0x16/0x197
[ 1257.829021] [<ffffffff8109569a>] trace_hardirqs_on+0xd/0xf
[ 1257.829021] [<ffffffff811002b9>] user_enter+0xfd/0x107
[ 1257.829021] [<ffffffff810029b4>] do_notify_resume+0x92/0x97
[ 1257.829021] [<ffffffff8154bdca>] int_signal+0x12/0x17
What happened was entering into the user code, the interrupts were enabled
and a max interrupts off was recorded. The trace buffer was saved along with
various information about the task: comm, pid, uid, priority, etc.
The uid is recorded with task_uid(tsk). But this is a macro that uses rcu_read_lock()
to retrieve the data, and this happened to happen where RCU is blind (user_enter).
As only the preempt and irqs off tracers can have this happen, and they both
only have the tsk == current, if tsk == current, use current_uid() instead of
task_uid(), as current_uid() does not use RCU as only current can change its uid.
This fixes the RCU suspicious splat.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Check if cpus_allowed is to be changed before calling validate_change().
This won't change any behavior, but later it will allow us to do this:
# mkdir /cpuset/child
# echo $$ > /cpuset/child/tasks /* empty cpuset */
# echo > /cpuset/child/cpuset.cpus /* do nothing, won't fail */
Without this patch, the last operation will fail.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- We never pass a NULL @cs to these functions.
- The top cpuset always has some online cpus/mems.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
* Rename it from files[] (really?) to cgroup_base_files[].
* Drop CGROUP_FILE_GENERIC_PREFIX which was defined as "cgroup." and
used inconsistently. Just use "cgroup." directly.
* Collect insane files at the end. Note that only the insane ones are
missing "cgroup." prefix.
This patch doesn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
The empty cgroup notification mechanism currently implemented in
cgroup is tragically outdated. Forking and execing userland process
stopped being a viable notification mechanism more than a decade ago.
We're gonna have a saner mechanism. Let's make it clear that this
abomination is going away.
Mark "notify_on_release" and "release_agent" with CFTYPE_INSANE.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Some resources controlled by cgroup aren't per-task and cgroup core
allowing threads of a single thread_group to be in different cgroups
forced memcg do explicitly find the group leader and use it. This is
gonna be nasty when transitioning to unified hierarchy and in general
we don't want and won't support granularity finer than processes.
Mark "tasks" with CFTYPE_INSANE.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: cgroups@vger.kernel.org
Cc: Vivek Goyal <vgoyal@redhat.com>
Ever since commit 45f035ab9b ("CONFIG_HOTPLUG should be always on"),
it has been basically impossible to build a kernel with CONFIG_HOTPLUG
turned off. Remove all the remaining references to it.
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Print physical address info in a style consistent with the %pR style
used elsewhere in the kernel.
Commit 69f1d475cc did this for a similar printk in this file, but I
must have missed this one.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull cgroup fixes from Tejun Heo:
- Fix for yet another xattr bug which may lead to NULL deref.
- A subtle bug in for_each_descendant_pre(). This bug requires quite
specific conditions to trigger and isn't too likely to actually
happen in the wild, but maybe that just makes it that much more
nastier.
- A warning message added for silly cgroup re-mount (not -o remount,
but unmount followed by mount) behavior.
* 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: warn about mismatching options of a new mount of an existing hierarchy
cgroup: fix a subtle bug in descendant pre-order walk
cgroup: initialize xattr before calling d_instantiate()
Since 7300711e ("clockevents: broadcast fixup possible waiters"),
the timekeeping duty is assigned to the CPU that handles the tick
broadcast clock device by the time it is set in one shot mode.
This is an issue in full dynticks mode where the timekeeping duty
must stay handled by the boot CPU for now. Otherwise it prevents
secondary CPUs from offlining and this breaks
suspend/shutdown/reboot/...
As it appears there is no reason for this timekeeping duty to be
moved to the broadcast CPU, besides nothing prevent it from being
later re-assigned to another target, let's simply remove it.
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 78becc2709 ("sched: Use an accessor to read the rq clock")
introduces rq_clock(), which obsoletes the use of the "rq" variable
in expire_cfs_rq_runtime() and triggers this build warning:
kernel/sched/fair.c: In function 'expire_cfs_rq_runtime':
kernel/sched/fair.c:2159:13: warning: unused variable 'rq' [-Wunused-variable]
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Turner <pjt@google.com>
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1369904660-14169-1-git-send-email-kamalesh@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In tick_nohz_cpu_down_callback() if the cpu is the one handling
timekeeping, we must return something that stops the CPU_DOWN_PREPARE
notifiers and then start notify CPU_DOWN_FAILED on the already called
notifier call backs.
However traditional errno values are not handled by the notifier unless
these are encapsulated using errno_to_notifier().
Hence the current -EINVAL is misinterpreted and converted to junk after
notifier_to_errno(), leaving the notifier subsystem to random behaviour
such as eventually allowing the cpu to go down.
Fix this by using the standard NOTIFY_BAD instead.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The kvm_host.h header file doesn't handle well
inclusion when archs don't support KVM.
This results in build crashes for such archs when they
want to implement context tracking because this subsystem
includes kvm_host.h in order to implement the
guest_enter/exit APIs but it doesn't handle KVM off case.
To fix this, move the guest_enter()/guest_exit()
declarations and generic implementation to the context
tracking headers. These generic APIs actually belong to
this subsystem, besides other domains boundary tracking
like user_enter() et al.
KVM now properly becomes a user of this library, not the
other buggy way around.
Reported-by: Kevin Hilman <khilman@linaro.org>
Reviewed-by: Kevin Hilman <khilman@linaro.org>
Tested-by: Kevin Hilman <khilman@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While computing the cputime delta of dynticks CPUs,
we are mixing up clocks of differents natures:
* local_clock() which takes care of unstable clock
sources and fix these if needed.
* sched_clock() which is the weaker version of
local_clock(). It doesn't compute any fixup in case
of unstable source.
If the clock source is stable, those two clocks are the
same and we can safely compute the difference against
two random points.
Otherwise it results in random deltas as sched_clock()
can randomly drift away, back or forward, from local_clock().
As a consequence, some strange behaviour with unstable tsc
has been observed such as non progressing constant zero cputime.
(The 'top' command showing no load).
Fix this by only using local_clock(), or its irq safe/remote
equivalent, in vtime code.
Reported-by: Mike Galbraith <efault@gmx.de>
Suggested-by: Mike Galbraith <efault@gmx.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Peter Anvin:
- Three EFI-related fixes
- Two early memory initialization fixes
- build fix for older binutils
- fix for an eager FPU performance regression -- currently we don't
allow the use of the FPU at interrupt time *at all* in eager mode,
which is clearly wrong.
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Allow FPU to be used at interrupt time even with eagerfpu
x86, crc32-pclmul: Fix build with older binutils
x86-64, init: Fix a possible wraparound bug in switchover in head_64.S
x86, range: fix missing merge during add range
x86, efi: initial the local variable of DataSize to zero
efivar: fix oops in efivar_update_sysfs_entries() caused by memory reuse
efivarfs: Never return ENOENT from firmware again
The branch selftest calls trace_test_buffer(), but with the new code
it expects the first parameter to be a pointer to a struct trace_buffer.
All self tests were changed but the branch selftest was missed.
This caused either a crash or failed test when the branch selftest was
enabled.
Link: http://lkml.kernel.org/r/20130529141333.GA24064@localhost
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Below is a patch from android kernel that maintains a histogram of
suspend times. Please review and provide feedback.
Statistices on the time spent in suspend are kept in
/sys/kernel/debug/sleep_time.
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Colin Cross <ccross@android.com>
Cc: Todd Poynor <toddpoynor@google.com>
Cc: San Mehat <san@google.com>
Cc: Benoit Goby <benoit@android.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Todd Poynor <toddpoynor@google.com>
[zoran.markovic@linaro.org: Re-formatted suspend time table to better
fit expected values. Moved accounting of suspend time into timekeeping
core. Removed CONFIG_SUSPEND_TIME flag and made the feature conditional
on CONFIG_DEBUG_FS. Changed the file name to sleep_time to better fit
terminology in timekeeping core. Changed seq_printf to seq_puts. Tweaked
commit message]
Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Add functions needed for hooking up alarmtimer to timerfd:
* alarm_restart: Similar to hrtimer_restart, restart an alarmtimer after
the expires time has already been updated (as with alarm_forward).
* alarm_forward_now: Similar to hrtimer_forward_now, move the expires
time forward to an interval from the current time of the associated clock.
* alarm_start_relative: Start an alarmtimer with an expires time relative to
the current time of the associated clock.
* alarm_expires_remaining: Similar to hrtimer_expires_remaining, return the
amount of time remaining until alarm expiry.
Signed-off-by: Todd Poynor <toddpoynor@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Some controllers have irqs that aren't wired up and must never be used.
For the generic chip attached to an irq_domain this provides a mask that
can be used to block out particular irqs so that they never get mapped.
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Link: http://lkml.kernel.org/r/1369793454-19197-2-git-send-email-grant.likely@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Provide infrastructure for irq chip implementations which work on
linear irq domains.
- Interface to allocate multiple generic chips which are associated to
the irq domain.
- Interface to get the generic chip pointer for a particular hardware
interrupt in the domain.
- irq domain mapping function to install the chip for a particular
interrupt.
Note: This lacks a removal function for now.
[ Sebastian Hesselbarth: Mask cache and pointer math fixups ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jean-Francois Moine <moinejf@free.fr>
Cc: devicetree-discuss@lists.ozlabs.org
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Gregory Clement <gregory.clement@free-electrons.com>
Cc: Gerlando Falauto <gerlando.falauto@keymile.com>
Cc: Rob Landley <rob@landley.net>
Acked-by: Grant Likely <grant.likely@linaro.org>
Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
Cc: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Link: http://lkml.kernel.org/r/20130506142539.450634298@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Preparatory patch for linear interrupt domains.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jean-Francois Moine <moinejf@free.fr>
Cc: devicetree-discuss@lists.ozlabs.org
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Gregory Clement <gregory.clement@free-electrons.com>
Cc: Gerlando Falauto <gerlando.falauto@keymile.com>
Cc: Rob Landley <rob@landley.net>
Acked-by: Grant Likely <grant.likely@linaro.org>
Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
Cc: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Link: http://lkml.kernel.org/r/20130506142539.377017672@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Some chips have weird bit mask access patterns instead of the linear
you expect. Allow them to calculate the cached mask themself.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jean-Francois Moine <moinejf@free.fr>
Cc: devicetree-discuss@lists.ozlabs.org
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Gregory Clement <gregory.clement@free-electrons.com>
Cc: Gerlando Falauto <gerlando.falauto@keymile.com>
Cc: Rob Landley <rob@landley.net>
Acked-by: Grant Likely <grant.likely@linaro.org>
Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
Cc: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Link: http://lkml.kernel.org/r/20130506142539.302898834@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cache the per irq bit mask instead of recalculating it over and over.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jean-Francois Moine <moinejf@free.fr>
Cc: devicetree-discuss@lists.ozlabs.org
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Gregory Clement <gregory.clement@free-electrons.com>
Cc: Gerlando Falauto <gerlando.falauto@keymile.com>
Cc: Rob Landley <rob@landley.net>
Acked-by: Grant Likely <grant.likely@linaro.org>
Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
Cc: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Link: http://lkml.kernel.org/r/20130506142539.227119865@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There are cases where all irq_chip_type instances have separate mask
registers, making a shared mask register cache unsuitable for the
purpose.
Introduce a new flag IRQ_GC_MASK_CACHE_PER_TYPE. If set, point the per
chip mask pointer to the per chip private mask cache instead.
[ tglx: Simplified code, renamed flag and massaged changelog ]
Signed-off-by: Gerlando Falauto <gerlando.falauto@keymile.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Joey Oravec <joravec@drewtech.com>
Cc: Lennert Buytenhek <kernel@wantstofly.org>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Holger Brunck <Holger.Brunck@keymile.com>
Cc: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
Acked-by: Grant Likely <grant.likely@linaro.org>
Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: devicetree-discuss@lists.ozlabs.org
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Ben Dooks <ben-linux@fluff.org>
Cc: Gregory Clement <gregory.clement@free-electrons.com>
Cc: Simon Guinot <simon@sequanux.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Jean-Francois Moine <moinejf@free.fr>
Cc: Nicolas Pitre <nico@fluxnic.net>
Cc: Rob Landley <rob@landley.net>
Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
Link: http://lkml.kernel.org/r/20130506142539.152569748@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Today the same interrupt mask cache (stored within struct irq_chip_generic)
is shared between all the irq_chip_type instances. As there are instances
where each irq_chip_type uses a distinct mask register (as it is the case
for Orion SoCs), sharing a single mask cache may be incorrect.
So add a distinct pointer for each irq_chip_type, which for now
points to the original mask register within irq_chip_generic.
So no functional changes here.
[ tglx: Minor cosmetic tweaks ]
Reported-by: Joey Oravec <joravec@drewtech.com>
Signed-off-by: Simon Guinot <sguinot@lacie.com>
Signed-off-by: Holger Brunck <holger.brunck@keymile.com>
Signed-off-by: Gerlando Falauto <gerlando.falauto@keymile.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Lennert Buytenhek <kernel@wantstofly.org>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Holger Brunck <Holger.Brunck@keymile.com>
Cc: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
Acked-by: Grant Likely <grant.likely@linaro.org>
Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: devicetree-discuss@lists.ozlabs.org
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Ben Dooks <ben-linux@fluff.org>
Cc: Gregory Clement <gregory.clement@free-electrons.com>
Cc: Simon Guinot <simon@sequanux.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Jean-Francois Moine <moinejf@free.fr>
Cc: Nicolas Pitre <nico@fluxnic.net>
Cc: Rob Landley <rob@landley.net>
Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
Link: http://lkml.kernel.org/r/20130506142539.082226607@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Since we already have an irq_data_get_chip_type() function which returns
a pointer to irq_chip_type, use that instead of cur_regs().
Signed-off-by: Gerlando Falauto <gerlando.falauto@keymile.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Joey Oravec <joravec@drewtech.com>
Cc: Lennert Buytenhek <kernel@wantstofly.org>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Holger Brunck <Holger.Brunck@keymile.com>
Cc: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
Acked-by: Grant Likely <grant.likely@linaro.org>
Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: devicetree-discuss@lists.ozlabs.org
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Ben Dooks <ben-linux@fluff.org>
Cc: Gregory Clement <gregory.clement@free-electrons.com>
Cc: Simon Guinot <simon@sequanux.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Jean-Francois Moine <moinejf@free.fr>
Cc: Nicolas Pitre <nico@fluxnic.net>
Cc: Rob Landley <rob@landley.net>
Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
Link: http://lkml.kernel.org/r/20130506142539.010164766@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
As rcu_dereference_raw() under RCU debug config options can add quite a
bit of checks, and that tracing uses rcu_dereference_raw(), these checks
happen with the function tracer. The function tracer also happens to trace
these debug checks too. This added overhead can livelock the system.
Have the function tracer use the new RCU _notrace equivalents that do
not do the debug checks for RCU.
Link: http://lkml.kernel.org/r/20130528184209.467603904@goodmis.org
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the new __DEVEL__sane_behavior mount option was introduced,
if the root cgroup is alive with no xattr function, to mount a
new cgroup with xattr will be rejected in terms of design which
just fine. However, if the root cgroup does not mounted with
__DEVEL__sane_hehavior, to create a new cgroup with xattr option
will succeed although after that the EA function does not works
as expected but will get ENOTSUPP for setting up attributes under
either cgroup. e.g.
setfattr: /cgroup2/test: Operation not supported
Instead of keeping silence in this case, it's better to drop a log
entry in warning level. That would be helpful to understand the
reason behind the scene from the user's perspective, and this is
essentially an improvement does not break the backward compatibilities.
With this fix, above mount attemption will keep up works as usual but
the following line cound be found at the system log:
[ ...] cgroup: new mount options do not match the existing superblock
tj: minor formatting / message updates.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Since commit 31ade30692, timekeeping_init()
checks for presence of persistent clock by attempting to read a non-zero
time value. This is an issue on platforms where persistent_clock (instead
is implemented as a free-running counter (instead of an RTC) starting
from zero on each boot and running during suspend. Examples are some ARM
platforms (e.g. PandaBoard).
An attempt to read such a clock during timekeeping_init() may return zero
value and falsely declare persistent clock as missing. Additionally, in
the above case suspend times may be accounted twice (once from
timekeeping_resume() and once from rtc_resume()), resulting in a gradual
drift of system time.
This patch does a run-time correction of the issue by doing the same check
during timekeeping_suspend().
A better long-term solution would have to return error when trying to read
non-existing clock and zero when trying to read an uninitialized clock, but
that would require changing all persistent_clock implementations.
This patch addresses the immediate breakage, for now.
Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Feng Tang <feng.tang@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
[jstultz: Tweaked commit message and subject]
Signed-off-by: John Stultz <john.stultz@linaro.org>
kernel/time/ntp.c: In function ‘__hardpps’:
kernel/time/ntp.c:877: warning: unused variable ‘flags’
commit a076b2146f ("ntp: Remove ntp_lock,
using the timekeeping locks to protect ntp state") removed its users,
but not the actual variable.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The first one was reported by Mauro Carvalho Chehab, where if a poll()
is done against a trace buffer for a CPU that has never been online,
it will crash the kernel, as buffers are only created when a CPU comes
on line, but the trace files are for all possible CPUs.
This fix is to check if the buffer was allocated and if not return -EINVAL.
That was the simple fix, the real fix is a bit more complex and not for
a -rc release. We could have the files created when the CPUs come online.
That would require some design changes.
The second one was reported by Peter Zijlstra. If the kernel command line
has ftrace=nop, it will lock up the system on boot up. This is because
the new design for 3.10 has the nop tracer bootstrap the tracing subsystem.
When ftrace=<trace> is defined, when a that tracer is registered, it
starts the tracing, but uses the nop tracer to clear things out.
What happened here was that ftrace=nop caused the registering of nop
to start it and use nop before it was initialized.
The only thing nop needs to have done to initialize it is to have the
tracer point its current_tracer structure member to the nop tracer.
Doing that before registering the nop tracer makes everything work.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJRpMkPAAoJEOdOSU1xswtMxeEIALWCnqSCKZJ0Oz+2TuR15vd2
Szm/knRBktRG2FizN8FIouXXMLIYM5HFSvO3Q2bWuV4Dv5KaqNcCEL5BggZC/+Rj
swt5+rMiUuln0teq792h2LhKwORw0YicLzWsyIZ82iSpcFKAseXqcMzEe/P/Emat
+J1QaoeDtOx/3X5Sv6tqHomqR80u7phQJwmIK6Yik389yLo3sy2XiPRk9PJqDpac
V9xbCnZlnopm7rLo7pEAI3R6Vn+MX6lrY1MO0xxjqeIvhvxr9nk0WIRnaevyARbt
eHnCtfa9pjn+bU9xYaFmyIkilc/IEBFRLb0dtEueH81nmaFDXpHI+h/pEFrDJqE=
=PR0j
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Two more fixes:
The first one was reported by Mauro Carvalho Chehab, where if a poll()
is done against a trace buffer for a CPU that has never been online,
it will crash the kernel, as buffers are only created when a CPU comes
on line, but the trace files are for all possible CPUs.
This fix is to check if the buffer was allocated and if not return
-EINVAL.
That was the simple fix, the real fix is a bit more complex and not
for a -rc release. We could have the files created when the CPUs come
online. That would require some design changes.
The second one was reported by Peter Zijlstra. If the kernel command
line has ftrace=nop, it will lock up the system on boot up. This is
because the new design for 3.10 has the nop tracer bootstrap the
tracing subsystem. When ftrace=<trace> is defined, when a that tracer
is registered, it starts the tracing, but uses the nop tracer to clear
things out. What happened here was that ftrace=nop caused the
registering of nop to start it and use nop before it was initialized.
The only thing nop needs to have done to initialize it is to have the
tracer point its current_tracer structure member to the nop tracer.
Doing that before registering the nop tracer makes everything work."
* tag 'trace-fixes-v3.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Do not poll non allocated cpu buffers
tracing: Fix crash when ftrace=nop on the kernel command line
The tracing infrastructure sets up for possible CPUs, but it uses
the ring buffer polling, it is possible to call the ring buffer
polling code with a CPU that hasn't been allocated. This will cause
a kernel oops when it access a ring buffer cpu buffer that is part
of the possible cpus but hasn't been allocated yet as the CPU has never
been online.
Reported-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Tested-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In old kernels, it's allowed to set softlockup_thresh to -1 or 0
to disable softlockup detection. However watchdog_thresh only
uses 0 to disable detection, and setting it to -1 just froze my
box and nothing I can do but reboot.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/r/51959668.9040106@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vince reported a problem found by his perf specific trinity
fuzzer.
Al noticed 2 problems with perf's mmap():
- it has issues against fork() since we use vma->vm_mm for accounting.
- it has an rb refcount leak on double mmap().
We fix the issues against fork() by using VM_DONTCOPY; I don't
think there's code out there that uses this; we didn't hear
about weird accounting problems/crashes. If we do need this to
work, the previously proposed VM_PINNED could make this work.
Aside from the rb reference leak spotted by Al, Vince's example
prog was indeed doing a double mmap() through the use of
perf_event_set_output().
This exposes another problem, since we now have 2 events with
one buffer, the accounting gets screwy because we account per
event. Fix this by making the buffer responsible for its own
accounting.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Link: http://lkml.kernel.org/r/20130528085548.GA12193@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix to free gone and unused optprobes. This bug will
cause a kernel panic if the user reuses the killed and
unused probe.
Reported at:
http://sourceware.org/ml/systemtap/2013-q2/msg00142.html
In the normal path, an optprobe on an init function is
unregistered when a module goes live.
unregister_kprobe(kp)
-> __unregister_kprobe_top
->__disable_kprobe
->disarm_kprobe(ap == op)
->__disarm_kprobe
->unoptimize_kprobe : the op is queued
on unoptimizing_list
and do nothing in __unregister_kprobe_bottom
After a while (usually wait 5 jiffies), kprobe_optimizer
runs to unoptimize and free optprobe.
kprobe_optimizer
->do_unoptimize_kprobes
->arch_unoptimize_kprobes : moved to free_list
->do_free_cleaned_kprobes
->hlist_del: the op is removed
->free_aggr_kprobe
->arch_remove_optimized_kprobe
->arch_remove_kprobe
->kfree: the op is freed
Here, if kprobes_module_callback is called and the delayed
unoptimizing probe is picked BEFORE kprobe_optimizer runs,
kprobes_module_callback
->kill_kprobe
->kill_optimized_kprobe : dequeued from unoptimizing_list <=!!!
->arch_remove_optimized_kprobe
->arch_remove_kprobe
(but op is not freed, and on the kprobe hash table)
This doesn't happen if the probe unregistration is done AFTER
kprobes_module_callback is called (because at that time the op
is gone), and kprobe-tracer does it.
To fix this bug, this patch changes kprobes_module_callback to
enqueue the op to freeing_list at kill_optimized_kprobe only
if the op is unused. The unused probes on freeing_list will
be freed in do_free_cleaned_kprobes.
Note that this calls arch_remove_*kprobe twice on the
same probe. Thus those functions have to check the double free.
Fortunately, most of arch codes already checked that except
for mips. This will be fixed in the next patch.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Timo Juhani Lindfors <timo.lindfors@iki.fi>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: systemtap@sourceware.org
Cc: yrl.pp-manager.tt@hitachi.com
Cc: David S. Miller <davem@davemloft.net>
Cc: "David S. Miller" <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20130522093409.9084.63554.stgit@mhiramat-M0-7522
[ Minor edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
commit 7eaeb34305 (clocksource: Provide unbind interface in sysfs)
implemented clocksource_select_fallback() which is not defined for
CONFIG_ARCH_USES_GETTIMEOFFSET=y. Add an empty inline function for
that.
Reported-by: Ingo Molnar <mingo@kernel.org>
Reported-by: fengguang.wu@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Read the runqueue clock through an accessor. This
prepares for adding a debugging infrastructure to
detect missing or redundant calls to update_rq_clock()
between a scheduler's entry and exit point.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Turner <pjt@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1365724262-20142-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In this function we are making use of rq->clock right before the
update of the rq clock, let's just call update_rq_clock() just
before that to avoid using a stale rq clock value.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Turner <pjt@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1365724262-20142-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
check_preempt_curr() of fair class needs an uptodate sched clock
value to update runtime stats of the current task of the target's rq.
When a task is woken up, activate_task() is usually called right before
ttwu_do_wakeup() unless the task is still in the runqueue. In the latter
case we need to update the rq clock explicitly because activate_task()
isn't here to do the job for us.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Turner <pjt@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1365724262-20142-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Because we may update the execution time in
sched_group_set_shares()->update_cfs_shares()->reweight_entity()->update_curr()
before reweighting the entity while setting the group shares and this requires
an uptodate version of the runqueue clock.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Turner <pjt@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1365724262-20142-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Because the sched_class::put_prev_task() callback of rt and fair
classes are referring to the rq clock to update their runtime
statistics. There is a missing rq clock update from the CPU
hotplug notifier's entry point of the scheduler.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Turner <pjt@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1365724262-20142-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
migration_call() will do all the things that update_runtime() does.
So let's remove it.
Furthermore, there is potential risk that the current code will catch
BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime
threads running because of enabling runtime twice while the rt_runtime
may already changed.
Signed-off-by: Neil Zhang <zhangwm@marvell.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1365685499-26515-1-git-send-email-zhangwm@marvell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In autogroup_create(), a tg is allocated and added to the task_groups
list. If CONFIG_RT_GROUP_SCHED is set, this tg is then modified while on
the list, without locking. This can race with someone walking the list,
like __enable_runtime() during CPU unplug, and result in a use-after-free
bug.
To fix this, move sched_online_group(), which adds the tg to the list,
to the end of the autogroup_create() function after the modification.
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1369411669-46971-2-git-send-email-gerald.schaefer@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
commit 26517f3e (tick: Avoid programming the local cpu timer if
broadcast pending) added a warning if the cpu enters broadcast mode
again while the pending bit is still set. Meelis reported that the
warning triggers. There are two corner cases which have been not
considered:
1) cpuidle calls clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER)
twice. That can result in the following scenario
CPU0 CPU1
cpuidle_idle_call()
clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER)
set cpu in tick_broadcast_oneshot_mask
broadcast interrupt
event expired for cpu1
set pending bit
acpi_idle_enter_simple()
clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER)
WARN_ON(pending bit)
Move the WARN_ON into the section where we enter broadcast mode so
it wont provide false positives on the second call.
2) safe_halt() enables interrupts, so a broadcast interrupt can be
delivered befor the broadcast mode is disabled. That sets the
pending bit for the CPU which receives the broadcast
interrupt. Though the interrupt is delivered right away from the
broadcast handler and leaves the pending bit stale.
Clear the pending bit for the current cpu in the broadcast handler.
Reported-and-tested-by: Meelis Roos <mroos@linux.ee>
Cc: Len Brown <lenb@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1305271841130.4220@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Unbreak architectures which do not use clockevents, but require to
build some of the core timekeeping infrastructure
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Parameters and usage of rt_mutex_adjust_prio_chain() are already
documented in Documentation/rt-mutex-design.txt. However, since this
function is called from several paths with different semantics (related
to the arguments), it is handy to have a quick reference directly in
the code.
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1368608650-7935-1-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch moves commit 7cc23cd to the generic code:
perf/x86/intel/lbr: Demand proper privileges for PERF_SAMPLE_BRANCH_KERNEL
The check is now implemented in generic code instead of x86 specific
code. That way we do not have to repeat the test in each arch
supporting branch sampling.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/20130521105337.GA2879@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds /sys/device/xxx/perf_event_mux_interval_ms to ajust
the multiplexing interval per PMU. The unit is milliseconds. Value has
to be >= 1.
In the 4th version, we renamed the sysfs file to be more consistent
with the other /proc/sys/kernel entries for perf_events.
In the 5th version, we handle the reprogramming of the hrtimer using
hrtimer_forward_now(). That way, we sync up to new timer value quickly
(suggested by Jiri Olsa).
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/1364991694-5876-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current scheme of using the timer tick was fine for per-thread
events. However, it was causing bias issues in system-wide mode
(including for uncore PMUs). Event groups would not get their fair
share of runtime on the PMU. With tickless kernels, if a core is idle
there is no timer tick, and thus no event rotation (multiplexing).
However, there are events (especially uncore events) which do count
even though cores are asleep.
This patch changes the timer source for multiplexing. It introduces a
per-PMU per-cpu hrtimer. The advantage is that even when a core goes
idle, it will come back to service the hrtimer, thus multiplexing on
system-wide events works much better.
The per-PMU implementation (suggested by PeterZ) enables adjusting the
multiplexing interval per PMU. The preferred interval is stashed into
the struct pmu. If not set, it will be forced to the default interval
value.
In order to minimize the impact of the hrtimer, it is turned on and
off on demand. When the PMU on a CPU is overcommited, the hrtimer is
activated. It is stopped when the PMU is not overcommitted.
In order for this to work properly, we had to change the order of
initialization in start_kernel() such that hrtimer_init() is run
before perf_event_init().
The default interval in milliseconds is set to a timer tick just like
with the old code. We will provide a sysctl to tune this in another
patch.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/1364991694-5876-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The hw breakpoint pmu 'add' function is missing the
period_left update needed for SW events.
The perf HW breakpoint events use the SW events framework
to process the overflow, so it needs to be properly initialized
in the PMU 'add' method.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1367421944-19082-5-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The Kconfig symbols ARCH_INLINE_READ_UNLOCK_IRQ,
ARCH_INLINE_SPIN_UNLOCK_IRQ, and ARCH_INLINE_WRITE_UNLOCK_IRQ were added
in v2.6.33, but have never actually been used. Ingo Molnar spotted that
this is caused by three identical copy/paste erros. Eg, the Kconfig
entry for
INLINE_READ_UNLOCK_IRQ
has an (optional) dependency on:
ARCH_INLINE_READ_UNLOCK_BH
were it apparently should depend on:
ARCH_INLINE_READ_UNLOCK_IRQ
instead. Likewise for the Kconfig entries for INLINE_SPIN_UNLOCK_IRQ and
INLINE_WRITE_UNLOCK_IRQ. Fix these three errors.
This never really caused any real problems as these symbols are set (or
unset) in a group - but it's worth fixing it nevertheless.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1368780693.1350.228.camel@x61.thuisdomein
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix kernel-doc warnings in kernel/auditfilter.c:
Warning(kernel/auditfilter.c:1029): Excess function parameter 'loginuid' description in 'audit_receive_filter'
Warning(kernel/auditfilter.c:1029): Excess function parameter 'sessionid' description in 'audit_receive_filter'
Warning(kernel/auditfilter.c:1029): Excess function parameter 'sid' description in 'audit_receive_filter'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
result in event_enable_func(). After checking the return status
of try_module_get(), it returned the status of try_module_get(). But
try_module_get() returns 0 on failure, which is success for
event_enable_func().
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJRnQn/AAoJEOdOSU1xswtMVDUH/3rOPX2/sbc817NN+KjXJiQi
1O8tmiOcaqMh742Df2YWSqXeM5IjARjl/xSZqazpGaDVu6HnMbEeb3Frx9hpzOPu
VEtBapasrPK6TOYSDfLaUuRsxuzSEsXR4dUexSh3o7f0/b1dY8x0BwiYxz3tz5BS
x6HX9OptUXUKDrloNC0qlX7ymuWmaeGULsTgCYYORfMe2FRFfvJhoCZFgC6dLw5x
YTubQuhVyNOD/X5jXM5h9kkUSw70VjGMhlqilyp0YLcnrhFL/QhCi7WR3b3hDwcp
MUpJyMAaPXlQHs2Q/gh46XldyhULPXamrujx8ISsDDdMQlWsWPTsQgfJ8e4/zsQ=
=n9S7
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Masami Hiramatsu fixed another bug. This time returning a proper
result in event_enable_func(). After checking the return status of
try_module_get(), it returned the status of try_module_get().
But try_module_get() returns 0 on failure, which is success for
event_enable_func()"
* tag 'trace-fixes-v3.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Return -EBUSY when event_enable_func() fails to get module
This patch converts cgroup_for_each_child(),
cgroup_next_descendant_pre/post() and thus
cgroup_for_each_descendant_pre/post() to use cgroup_next_sibling()
instead of manually dereferencing ->sibling.next.
The only reason the iterators couldn't allow dropping RCU read lock
while iteration is in progress was because they couldn't determine the
next sibling safely once RCU read lock is dropped. Using
cgroup_next_sibling() removes that problem and enables all iterators
to allow dropping RCU read lock in the middle. Comments are updated
accordingly.
This makes the iterators easier to use and will simplify controllers.
Note that @cgroup argument is renamed to @cgrp in
cgroup_for_each_child() because it conflicts with "struct cgroup" used
in the new macro body.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section. This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive. This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible. A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling. IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next. This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order. A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
cgroup_is_removed() no longer has external users and it shouldn't grow
any - controllers should deal with cgroup_subsys_state on/offline
state instead of cgroup removal state. Make it static.
While at it, make it return bool.
Signed-off-by: Tejun Heo <tj@kernel.org>
Merging to receive 7805d000db ("cgroup: fix a subtle bug in descendant
pre-order walk") so that further iterator updates can build upon it.
Signed-off-by: Tejun Heo <tj@kernel.org>
When cgroup_next_descendant_pre() initiates a walk, it checks whether
the subtree root doesn't have any children and if not returns NULL.
Later code assumes that the subtree isn't empty. This is broken
because the subtree may become empty inbetween, which can lead to the
traversal escaping the subtree by walking to the sibling of the
subtree root.
There's no reason to have the early exit path. Remove it along with
the later assumption that the subtree isn't empty. This simplifies
the code a bit and fixes the subtle bug.
While at it, fix the comment of cgroup_for_each_descendant_pre() which
was incorrectly referring to ->css_offline() instead of
->css_online().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: stable@vger.kernel.org
If ftrace=<tracer> is on the kernel command line, when that tracer is
registered, it will be initiated by tracing_set_tracer() to execute that
tracer.
The nop tracer is just a stub tracer that is used to have no tracer
enabled. It is assigned at early bootup as it is the default tracer.
But if ftrace=nop is on the kernel command line, the registering of the
nop tracer will call tracing_set_tracer() which will try to execute
the nop tracer. But it expects tr->current_trace to be assigned something
as it usually is assigned to the nop tracer. As it hasn't been assigned
to anything yet, it causes the system to crash.
The simple fix is to move the tr->current_trace = nop before registering
the nop tracer. The functionality is still the same as the nop tracer
doesn't do anything anyway.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
to avoid false positives (previously it was only scanning specific
sections and missing .ref.data).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iQIcBAABAgAGBQJRlnWwAAoJEGvWsS0AyF7x5EEP/Rb1yjDKgw6RdBdHCAa4CokX
gmOwTW2PVLuNqxu3JUE+GQHTAkZnLHFLzT4uSI4CAs0YdrdI+aCnJvy2jK6vZ0WC
bD66hMxOoiJ4sZSA+mU19Zjwb1pRFWi9+sOrYgrC+AbYd45Y6psshn0kog+HslXa
y2fv/VKqfSUMRJ+lB2p6jXcwxB1bFm4jcYM8OleKhdbb7QUkAZjftpg83hkTSq3n
+eHQZxWTaeVubwFDmRQf2nPkixNrSI0ZTbOKgHUBJLvNAsxY2/eE3cvJY17NbfdH
Hq9o7FmWPyRYrVHUzo5S0LbFev3tGUxLzc53G8DfajXRQbNwtAVhkpfX9vV2toH4
ze/3dIMboUC+yRR9oH0pdRVwndq8oPtWAAKxfOrXKcm+jue2obtQDuswvEmtaufF
ez10vF02doPZgjDeXKZY6hO2LeyjSh82opk4oSMmgsBjTBlsXxrelNAbMxHIiSnx
SClCJUm+0PcAhxyehmOb2N95CmGi0sZd2Nwo0QAOBK/B3gyWxtz5qGQ0iADT5wsH
fI2KLduzyEKNv5phF5Ct8BdA3p89J64/K9HDnV5dA1W8aRudE8fdEpOHlIb3mbn5
NsMg4ahhKPOJM1IX+YBSCXxJapgBAumlXwfXzQi3CzUF0iBmRS2enybbLtR/fxU6
5+o92idKeq9NQwhmGH7v
=LeOH
-----END PGP SIGNATURE-----
Merge tag 'kmemleak-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64
Pull kmemleak patches from Catalin Marinas:
"Kmemleak now scans all the writable and non-executable module sections
to avoid false positives (previously it was only scanning specific
sections and missing .ref.data)."
* tag 'kmemleak-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64:
kmemleak: No need for scanning specific module sections
kmemleak: Scan all allocated, writeable and not executable module sections
Christian found v3.9 does not work with E350 with EFI is enabled.
[ 1.658832] Trying to unpack rootfs image as initramfs...
[ 1.679935] BUG: unable to handle kernel paging request at ffff88006e3fd000
[ 1.686940] IP: [<ffffffff813661df>] memset+0x1f/0xb0
[ 1.692010] PGD 1f77067 PUD 1f7a067 PMD 61420067 PTE 0
but early memtest report all memory could be accessed without problem.
early page table is set in following sequence:
[ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[ 0.000000] init_memory_mapping: [mem 0x6e600000-0x6e7fffff]
[ 0.000000] init_memory_mapping: [mem 0x6c000000-0x6e5fffff]
[ 0.000000] init_memory_mapping: [mem 0x00100000-0x6bffffff]
[ 0.000000] init_memory_mapping: [mem 0x6e800000-0x6ea07fff]
but later efi_enter_virtual_mode try set mapping again wrongly.
[ 0.010644] pid_max: default: 32768 minimum: 301
[ 0.015302] init_memory_mapping: [mem 0x640c5000-0x6e3fcfff]
that means it fails with pfn_range_is_mapped.
It turns out that we have a bug in add_range_with_merge and it does not
merge range properly when new add one fill the hole between two exsiting
ranges. In the case when [mem 0x00100000-0x6bffffff] is the hole between
[mem 0x00000000-0x000fffff] and [mem 0x6c000000-0x6e7fffff].
Fix the add_range_with_merge by calling itself recursively.
Reported-by: "Christian König" <christian.koenig@amd.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/CAE9FiQVofGoSk7q5-0irjkBxemqK729cND4hov-1QCBJDhxpgQ@mail.gmail.com
Cc: <stable@vger.kernel.org> v3.9
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
As kmemleak now scans all module sections that are allocated, writable
and non executable, there's no need to scan individual sections that
might reference data.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Instead of just picking data sections by name (names that start
with .data, .bss or .ref.data), use the section flags and scan all
sections that are allocated, writable and not executable. Which should
cover all sections of a module that might reference data.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
[catalin.marinas@arm.com: removed unused 'name' variable]
[catalin.marinas@arm.com: collapsed 'if' blocks]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Pull workqueue fixes from Tejun Heo:
"Three more workqueue regression fixes.
- Fix unbalanced unlock in trylock failure path of manage_workers().
This shouldn't happen often in the wild but is possible.
- While making schedule_work() and friends inline, they become
unavailable to !GPL modules. Allow !GPL modules to access basic
stuff - system_wq and queue_*work_on() - so that schedule_work()
and friends can be used.
- During boot, the unbound NUMA support code allocates a cpumask for
each possible node using alloc_cpumask_var_node(), which ends up
trying to allocate node-specific memory even for offline nodes
triggering BUG in the memory alloc code. Use NUMA_NO_NODE for
offline nodes."
* 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: don't perform NUMA-aware allocations on offline nodes in wq_numa_init()
workqueue: Make schedule_work() available again to non GPL modules
workqueue: correct handling of the pool spin_lock
Pull RCU fixes from Paul McKenney:
"A couple of fixes for RCU regressions:
- A boneheaded boolean-logic bug that resulted in excessive delays on
boot, hibernation and suspend that was reported by Borislav Petkov,
Bjørn Mork, and Joerg Roedel. The fix inserts a single "!".
- A fix for a boot-time splat due to allocating from bootmem too late
in boot, fix courtesy of Sasha Levin with additional help from
Yinghai Lu."
* 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
rcu: Don't allocate bootmem from rcu_init()
rcu: Fix comparison sense in rcu_needs_cpu()
argv_split(empty_or_all_spaces) happily succeeds, it simply returns
argc == 0 and argv[0] == NULL. Change call_usermodehelper_exec() to
check sub_info->path != NULL to avoid the crash.
This is the minimal fix, todo:
- perhaps we should change argv_split() to return NULL or change the
callers.
- kill or justify ->path[0] check
- narrow the scope of helper_lock()
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-By: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since try_module_get() returns false( = 0) when it fails to
pindown a module, event_enable_func() returns 0 which means
"succeed". This can cause a kernel panic when the entry
is removed, because the event is already released.
This fixes the bug by returning -EBUSY, because the reason
why it fails is that the module is being removed at that time.
Link: http://lkml.kernel.org/r/20130516114848.13508.97899.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Provide a sysfs interface to allow unbinding of clockevent
devices. The device is unbound if it is unused or if there is a
replacement device available. Unbinding of broadcast devices is not
supported as we don't want to foster that nonsense. If no replacement
device is available the unbind returns -EBUSY. Unbind is available
from the kernel and through sysfs, which is necessary to drop the
module refcount.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130425143436.499216659@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Split out the clockevent device selection logic. Preparatory patch to
allow unbinding active clockevent devices.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130425143436.431796247@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Provide a simple sysfs interface for the clockevent devices. Show the
current active clockevent device.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130425143436.371634778@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We want to be able to remove clockevent modules as well. Add a
refcount so we don't remove a module with an active clock event
device.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130425143436.307435149@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
No need to call another function and have duplicated cases.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130425143436.235746557@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now that the notifier chain is gone there are no other users and it's
pointless to nest tick_device_lock inside of clockevents_lock because
there is no other use case.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130425143436.162888472@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7+ years and still a single user. Kill it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130425143436.098520211@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The unregister call can fail, if the clocksource is the current one
and there is no replacement clocksource available. It can also fail,
if the clocksource is the watchdog clocksource and I'm not going to
provide support for this.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130425143436.029915527@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
With the module refcount held for the current clocksource there is no
way to unload the module.
Provide a sysfs interface which allows to unbind the clocksource. One
could argue that the clocksource override could be (ab)used to do so,
but the clocksource override cannot be used from the kernel itself,
while an unbind function can be used to programmatically check whether
a clocksource can be shutdown or not.
The unbind functionality uses the new skip current feature of
clocksource_select and verifies that a fallback clocksource has been
installed. If the clocksource which should be unbound is the current
clocksource and no fallback can be found, unbind returns -EBUSY.
This does not support the unbinding of a clocksource which is used as
the watchdog clocksource. No point in fostering crappy hardware.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130425143435.964218245@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Split out the user string input for clocksource override. Preparatory
patch for unbind.
[ jstultz: Fix an off by one error ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130425143435.895851338@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Preparatory patch for clocksource unbind support.
Split out code from clocksource_select and modify it, so it skips the
current clocksource on request and tries to find a fallback
clocksource. Convert all existing users. No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130425143435.834965397@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add a module refcount, so the current clocksource cannot be removed
unconditionally.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130425143435.762417789@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
timekeeping_notify() can fail due cs->enable() failure. Though the
caller does not notice and happily keeps the wrong clocksource as the
current one.
Let the caller know about failure, so the current clocksource will be
shown correctly in sysfs.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130425143435.696321912@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If a clocksource has a (wrong) high rating, but can't be used as a
timebase for oneshot tick mode, it is unconditionally selected even
when the system is already in oneshot tick mode. This causes full
system failure.
Verify the clocksource selection against the oneshot mode.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130425143435.635040849@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Also, Masami Hiramatsu fixed up some minor bugs that were discovered
by sparse.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJRk+PYAAoJEOdOSU1xswtMgO4H/AlePQ4IEOfXgy43Xx8S7Uew
FmvHYmUi5l6UuFRLxcVZBsnSqo3i43UGyj9Lm+g1wBwNP3IDEf+t+fVaN4UP18KW
C+5OoEWyJLe4BlQoVsdBV1+lmivacG3uczvAPY6fibyTqbJN67uzufPLw8ruBMOo
dIIXWUR1sKRyy9+SF23q0rwf5ZfuavlMnmeTZ6omk0evbVT2q0lbpUeN/V1Rrxmc
ZUObTyoFvVMvPYnwutGh7+QB/o0LR9C6MPyyFTvz/o9bD9CDXdbTvSQatoYrbEm8
9/0hpetm+KJpa2M9mf4djeXWCIFX3gBuQ156LEFhS68Ug+HDWQ7Yv9iJQmmR6OE=
=ZToX
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"This includes a fix to a memory leak when adding filters to traces.
Also, Masami Hiramatsu fixed up some minor bugs that were discovered
by sparse."
* tag 'trace-fixes-v3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/kprobes: Make print_*probe_event static
tracing/kprobes: Fix a sparse warning for incorrect type in assignment
tracing/kprobes: Use rcu_dereference_raw for tp->files
tracing: Fix leaks of filter preds
Pull perf fixes from Thomas Gleixner:
- Fix for a task exit cleanup race caused by a missing a preempt
disable
- Cleanup of the event notification functions with a massive reduction
of duplicated code
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Factor out auxiliary events notification
perf: Fix EXIT event notification
Pull timer fixes from Thomas Gleixner:
- Cure for not using zalloc in the first place, which leads to random
crashes with CPUMASK_OFF_STACK.
- Revert a user space visible change which broke udev
- Add a missing cpu_online early return introduced by the new full
dyntick conversions
- Plug a long standing race in the timer wheel cpu hotplug code.
Sigh...
- Cleanup NOHZ per cpu data on cpu down to prevent stale data on cpu
up.
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Revert ALWAYS_USE_PERSISTENT_CLOCK compile time optimizaitons
timer: Don't reinitialize the cpu base lock during CPU_UP_PREPARE
tick: Don't invoke tick_nohz_stop_sched_tick() if the cpu is offline
tick: Cleanup NOHZ per cpu data on cpu down
tick: Use zalloc_cpumask_var for allocating offstack cpumasks
Pull core fixes from Thomas Gleixner:
- Two fixlets for the fallout of the generic idle task conversion
- Documentation update
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu/idle: Wrap cpu-idle poll mode within rcu_idle_enter/exit
idle: Fix hlt/nohlt command-line handling in new generic idle
kthread: Document ways of reducing OS jitter due to per-CPU kthreads
According to sparse warning, print_*probe_event static because
those functions are not directly called from outside.
Link: http://lkml.kernel.org/r/20130513115839.6545.83067.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use rcu_dereference_raw() for accessing tp->files. Because the
write-side uses rcu_assign_pointer() for memory barrier,
the read-side also has to use rcu_dereference_raw() with
read memory barrier.
Link: http://lkml.kernel.org/r/20130513115834.6545.17022.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Special preds are created when folding a series of preds that
can be done in serial. These are allocated in an ops field of
the pred structure. But they were never freed, causing memory
leaks.
This was discovered using the kmemleak checker:
unreferenced object 0xffff8800797fd5e0 (size 32):
comm "swapper/0", pid 1, jiffies 4294690605 (age 104.608s)
hex dump (first 32 bytes):
00 00 01 00 03 00 05 00 07 00 09 00 0b 00 0d 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff814b52af>] kmemleak_alloc+0x73/0x98
[<ffffffff8111ff84>] kmemleak_alloc_recursive.constprop.42+0x16/0x18
[<ffffffff81120e68>] __kmalloc+0xd7/0x125
[<ffffffff810d47eb>] kcalloc.constprop.24+0x2d/0x2f
[<ffffffff810d4896>] fold_pred_tree_cb+0xa9/0xf4
[<ffffffff810d3781>] walk_pred_tree+0x47/0xcc
[<ffffffff810d5030>] replace_preds.isra.20+0x6f8/0x72f
[<ffffffff810d50b5>] create_filter+0x4e/0x8b
[<ffffffff81b1c30d>] ftrace_test_event_filter+0x5a/0x155
[<ffffffff8100028d>] do_one_initcall+0xa0/0x137
[<ffffffff81afbedf>] kernel_init_freeable+0x14d/0x1dc
[<ffffffff814b24b7>] kernel_init+0xe/0xdb
[<ffffffff814d539c>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: stable@vger.kernel.org # 2.6.39+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When rcu_init() is called we already have slab working, allocating
bootmem at that point results in warnings and an allocation from
slab. This commit therefore changes alloc_bootmem_cpumask_var() to
alloc_cpumask_var() in rcu_bootup_announce_oddness(), which is called
from rcu_init().
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Tested-by: Robin Holt <holt@sgi.com>
[paulmck: convert to zalloc_cpumask_var(), as suggested by Yinghai Lu.]
Add wait_on_atomic_t() and wake_up_atomic_t() to indicate became-zero events on
atomic_t types. This uses the bit-wake waitqueue table. The key is set to a
value outside of the number of bits in a long so that wait_on_bit() won't be
woken up accidentally.
What I'm using this for is: in a following patch I add a counter to struct
fscache_cookie to count the number of outstanding operations that need access
to netfs data. The way this works is:
(1) When a cookie is allocated, the counter is initialised to 1.
(2) When an operation wants to access netfs data, it calls atomic_inc_unless()
to increment the counter before it does so. If it was 0, then the counter
isn't incremented, the operation isn't permitted to access the netfs data
(which might by this point no longer exist) and the operation aborts in
some appropriate manner.
(3) When an operation finishes with the netfs data, it decrements the counter
and if it reaches 0, calls wake_up_atomic_t() on it - the assumption being
that it was the last blocker.
(4) When a cookie is released, the counter is decremented and the releaser
uses wait_on_atomic_t() to wait for the counter to become 0 - which should
indicate no one is using the netfs data any longer. The netfs data can
then be destroyed.
There are some alternatives that I have thought of and that have been suggested
by Tejun Heo:
(A) Using wait_on_bit() to wait on a bit in the counter. This doesn't work
because if that bit happens to be 0 then the wait won't happen - even if
the counter is non-zero.
(B) Using wait_on_bit() to wait on a flag elsewhere which is cleared when the
counter reaches 0. Such a flag would be redundant and would add
complexity.
(C) Adding a waitqueue to fscache_cookie - this would expand that struct by
several words for an event that happens just once in each cookie's
lifetime. Further, cookies are generally per-file so there are likely to
be a lot of them.
(D) Similar to (C), but add a pointer to a waitqueue in the cookie instead of
a waitqueue. This would add single word per cookie and so would be less
of an expansion - but still an expansion.
(E) Adding a static waitqueue to the fscache module. Generally this would be
fine, but under certain circumstances many cookies will all get added at
the same time (eg. NFS umount, cache withdrawal) thereby presenting
scaling issues. Note that the wait may be significant as disk I/O may be
in progress.
So, I think reusing the wait_on_bit() waitqueue set is reasonable. I don't
make much use of the waitqueue I need on a per-cookie basis, but sometimes I
have a huge flood of the cookies to deal with.
I also don't want to add a whole new set of global waitqueue tables
specifically for the dec-to-0 event if I can reuse the bit tables.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Kay Sievers noted that the ALWAYS_USE_PERSISTENT_CLOCK config,
which enables some minor compile time optimization to avoid
uncessary code in mostly the suspend/resume path could cause
problems for userland.
In particular, the dependency for RTC_HCTOSYS on
!ALWAYS_USE_PERSISTENT_CLOCK, which avoids setting the time
twice and simplifies suspend/resume, has the side effect
of causing the /sys/class/rtc/rtcN/hctosys flag to always be
zero, and this flag is commonly used by udev to setup the
/dev/rtc symlink to /dev/rtcN, which can cause pain for
older applications.
While the udev rules could use some work to be less fragile,
breaking userland should strongly be avoided. Additionally
the compile time optimizations are fairly minor, and the code
being optimized is likely to be reworked in the future, so
lets revert this change.
Reported-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: stable <stable@vger.kernel.org> #3.9
Cc: Feng Tang <feng.tang@intel.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Link: http://lkml.kernel.org/r/1366828376-18124-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Commit 8425e3d5bd ("workqueue: inline trivial wrappers") changed
schedule_work() and schedule_delayed_work() to inline wrappers,
but these rely on some symbols that are EXPORT_SYMBOL_GPL, while
the original functions were EXPORT_SYMBOL. This has the effect of
changing the licensing requirement for these functions and making
them unavailable to non GPL modules.
Make them available again by removing the restriction on the
required symbols.
Signed-off-by: Marc Dionne <marc.dionne@your-file-system.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When we fail to mutex_trylock(), we release the pool spin_lock and do
mutex_lock(). After that, we should regrab the pool spin_lock, but,
regrabbing is missed in current code. So correct it.
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
kdbus folks want a sane way to determine the cgroup path that a given
task belongs to on a given hierarchy, which is a reasonble thing to
expect from cgroup core.
Implement task_cgroup_path_from_hierarchy().
v2: Dropped unnecessary NULL check on the return value of
task_cgroup_from_root() as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <greg@kroah.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <daniel@zonque.org>
We want to be able to lookup a hierarchy from its id and cyclic
allocation is a whole lot simpler with idr. Convert to idr and use
idr_alloc_cyclc().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Now that hierarchy_id alloc / free are protected by the cgroup
mutexes, there's no need for this separate lock. Drop it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
We're planning to converting hierarchy_ida to an idr and use it to
look up hierarchy from its id. As we want the mapping to happen
atomically with cgroupfs_root registration, this patch refactors
hierarchy_id init / exit so that ida operations happen inside
cgroup_[root_]mutex.
* s/init_root_id()/cgroup_init_root_id()/ and make it return 0 or
-errno like a normal function.
* Move hierarchy_id initialization from cgroup_root_from_opts() into
cgroup_mount() block where the root is confirmed to be used and
being registered while holding both mutexes.
* Split cgroup_drop_id() into cgroup_exit_root_id() and
cgroup_free_root(), so that ID release can happen before dropping
the mutexes in cgroup_kill_sb(). The latter expects hierarchy_id to
be exited before being invoked.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Commit c0f4dfd4f (rcu: Make RCU_FAST_NO_HZ take advantage of numbered
callbacks) introduced a bug that can result in excessively long grace
periods. This bug reverse the senes of the "if" statement checking
for lazy callbacks, so that RCU takes a lazy approach when there are
in fact non-lazy callbacks. This can result in excessive boot, suspend,
and resume times.
This commit therefore fixes the sense of this "if" statement.
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Bjørn Mork <bjorn@mork.no>
Reported-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Bjørn Mork <bjorn@mork.no>
Tested-by: Joerg Roedel <joro@8bytes.org>
This patch adds system wide workqueues aligned towards power saving. This is
done by allocating them with WQ_UNBOUND flag if 'wq_power_efficient' is set to
'true'.
tj: updated comments a bit.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Workqueues can be performance or power-oriented. Currently, most workqueues are
bound to the CPU they were created on. This gives good performance (due to cache
effects) at the cost of potentially waking up otherwise idle cores (Idle from
scheduler's perspective. Which may or may not be physically idle) just to
process some work. To save power, we can allow the work to be rescheduled on a
core that is already awake.
Workqueues created with the WQ_UNBOUND flag will allow some power savings.
However, we don't change the default behaviour of the system. To enable
power-saving behaviour, a new config option CONFIG_WQ_POWER_EFFICIENT needs to
be turned on. This option can also be overridden by the
workqueue.power_efficient boot parameter.
tj: Updated config description and comments. Renamed
CONFIG_WQ_POWER_EFFICIENT to CONFIG_WQ_POWER_EFFICIENT_DEFAULT.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull workqueue fix from Tejun Heo:
"A fix for a workqueue_congested() regression that broke fscache"
* 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: workqueue_congested() shouldn't translate WORK_CPU_UNBOUND into node number
An inactive timer's base can refer to a offline cpu's base.
In the current code, cpu_base's lock is blindly reinitialized each
time a CPU is brought up. If a CPU is brought online during the period
that another thread is trying to modify an inactive timer on that CPU
with holding its timer base lock, then the lock will be reinitialized
under its feet. This leads to following SPIN_BUG().
<0> BUG: spinlock already unlocked on CPU#3, kworker/u:3/1466
<0> lock: 0xe3ebe000, .magic: dead4ead, .owner: kworker/u:3/1466, .owner_cpu: 1
<4> [<c0013dc4>] (unwind_backtrace+0x0/0x11c) from [<c026e794>] (do_raw_spin_unlock+0x40/0xcc)
<4> [<c026e794>] (do_raw_spin_unlock+0x40/0xcc) from [<c076c160>] (_raw_spin_unlock+0x8/0x30)
<4> [<c076c160>] (_raw_spin_unlock+0x8/0x30) from [<c009b858>] (mod_timer+0x294/0x310)
<4> [<c009b858>] (mod_timer+0x294/0x310) from [<c00a5e04>] (queue_delayed_work_on+0x104/0x120)
<4> [<c00a5e04>] (queue_delayed_work_on+0x104/0x120) from [<c04eae00>] (sdhci_msm_bus_voting+0x88/0x9c)
<4> [<c04eae00>] (sdhci_msm_bus_voting+0x88/0x9c) from [<c04d8780>] (sdhci_disable+0x40/0x48)
<4> [<c04d8780>] (sdhci_disable+0x40/0x48) from [<c04bf300>] (mmc_release_host+0x4c/0xb0)
<4> [<c04bf300>] (mmc_release_host+0x4c/0xb0) from [<c04c7aac>] (mmc_sd_detect+0x90/0xfc)
<4> [<c04c7aac>] (mmc_sd_detect+0x90/0xfc) from [<c04c2504>] (mmc_rescan+0x7c/0x2c4)
<4> [<c04c2504>] (mmc_rescan+0x7c/0x2c4) from [<c00a6a7c>] (process_one_work+0x27c/0x484)
<4> [<c00a6a7c>] (process_one_work+0x27c/0x484) from [<c00a6e94>] (worker_thread+0x210/0x3b0)
<4> [<c00a6e94>] (worker_thread+0x210/0x3b0) from [<c00aad9c>] (kthread+0x80/0x8c)
<4> [<c00aad9c>] (kthread+0x80/0x8c) from [<c000ea80>] (kernel_thread_exit+0x0/0x8)
As an example, this particular crash occurred when CPU #3 is executing
mod_timer() on an inactive timer whose base is refered to offlined CPU
#2. The code locked the timer_base corresponding to CPU #2. Before it
could proceed, CPU #2 came online and reinitialized the spinlock
corresponding to its base. Thus now CPU #3 held a lock which was
reinitialized. When CPU #3 finally ended up unlocking the old cpu_base
corresponding to CPU #2, we hit the above SPIN_BUG().
CPU #0 CPU #3 CPU #2
------ ------- -------
..... ...... <Offline>
mod_timer()
lock_timer_base
spin_lock_irqsave(&base->lock)
cpu_up(2) ..... ......
init_timers_cpu()
.... ..... spin_lock_init(&base->lock)
..... spin_unlock_irqrestore(&base->lock) ......
<spin_bug>
Allocation of per_cpu timer vector bases is done only once under
"tvec_base_done[]" check. In the current code, spinlock_initialization
of base->lock isn't under this check. When a CPU is up each time the
base lock is reinitialized. Move base spinlock initialization under
the check.
Signed-off-by: Tirupathi Reddy <tirupath@codeaurora.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1368520142-4136-1-git-send-email-tirupath@codeaurora.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Bjørn Mork reported the following warning when running powertop.
[ 49.289034] ------------[ cut here ]------------
[ 49.289055] WARNING: at kernel/rcutree.c:502 rcu_eqs_exit_common.isra.48+0x3d/0x125()
[ 49.289244] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.0-bisect-rcu-warn+ #107
[ 49.289251] ffffffff8157d8c8 ffffffff81801e28 ffffffff8137e4e3 ffffffff81801e68
[ 49.289260] ffffffff8103094f ffffffff81801e68 0000000000000000 ffff88023afcd9b0
[ 49.289268] 0000000000000000 0140000000000000 ffff88023bee7700 ffffffff81801e78
[ 49.289276] Call Trace:
[ 49.289285] [<ffffffff8137e4e3>] dump_stack+0x19/0x1b
[ 49.289293] [<ffffffff8103094f>] warn_slowpath_common+0x62/0x7b
[ 49.289300] [<ffffffff8103097d>] warn_slowpath_null+0x15/0x17
[ 49.289306] [<ffffffff810a9006>] rcu_eqs_exit_common.isra.48+0x3d/0x125
[ 49.289314] [<ffffffff81079b49>] ? trace_hardirqs_off_caller+0x37/0xa6
[ 49.289320] [<ffffffff810a9692>] rcu_idle_exit+0x85/0xa8
[ 49.289327] [<ffffffff8107076e>] trace_cpu_idle_rcuidle+0xae/0xff
[ 49.289334] [<ffffffff810708b1>] cpu_startup_entry+0x72/0x115
[ 49.289341] [<ffffffff813689e5>] rest_init+0x149/0x150
[ 49.289347] [<ffffffff8136889c>] ? csum_partial_copy_generic+0x16c/0x16c
[ 49.289355] [<ffffffff81a82d34>] start_kernel+0x3f0/0x3fd
[ 49.289362] [<ffffffff81a8274c>] ? repair_env_string+0x5a/0x5a
[ 49.289368] [<ffffffff81a82481>] x86_64_start_reservations+0x2a/0x2c
[ 49.289375] [<ffffffff81a82550>] x86_64_start_kernel+0xcd/0xd1
[ 49.289379] ---[ end trace 07a1cc95e29e9036 ]---
The warning is that 'rdtp->dynticks' has an unexpected value, which roughly
translates to - the calls to rcu_idle_enter() and rcu_idle_exit() were not
made in the correct order, or otherwise messed up.
And Bjørn's painstaking debugging indicated that this happens when the idle
loop enters the poll mode. Looking at the poll function cpu_idle_poll(), and
the implementation of trace_cpu_idle_rcuidle(), the problem becomes very clear:
cpu_idle_poll() lacks calls to rcu_idle_enter/exit(), and trace_cpu_idle_rcuidle()
calls them in the reverse order - first rcu_idle_exit(), and then rcu_idle_enter().
Hence the even/odd alternative sequencing of rdtp->dynticks goes for a toss.
And powertop readily triggers this because powertop uses the idle-tracing
infrastructure extensively.
So, to fix this, wrap the code in cpu_idle_poll() within rcu_idle_enter/exit(),
so that it blends properly with the calls inside trace_cpu_idle_rcuidle() and
thus get the function ordering right.
Reported-and-tested-by: Bjørn Mork <bjorn@mork.no>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/519169BF.4080208@linux.vnet.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
commit 5b39939a4 (nohz: Move ts->idle_calls incrementation into strict
idle logic) moved code out of tick_nohz_stop_sched_tick() and missed
to bail out when the cpu is offline. That's causing subsequent
failures as an offline CPU is supposed to die and not to fiddle with
nohz magic.
Return false in can_stop_idle_tick() if the cpu is offline.
Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz>
Reported-and-tested-by: Prarit Bhargava <prarit@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1305132138160.2863@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
cgroup_create_file() calls d_instantiate(), which may decide to look
at the xattrs on the file. Smack always does this and SELinux can be
configured to do so.
But cgroup_add_file() didn't initialize xattrs before calling
cgroup_create_file(), which finally leads to dereferencing NULL
dentry->d_fsdata.
This bug has been there since cgroup xattr was introduced.
Cc: <stable@vger.kernel.org> # 3.8.x
Reported-by: Ivan Bulatovic <combuster@archlinux.us>
Reported-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Avoid waking up every thread sleeping in a sigtimedwait call during
suspend and resume by calling a freezable blocking call. Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.
This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Avoid waking up every thread sleeping in a nanosleep call during
suspend and resume by calling a freezable blocking call. Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.
This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Avoid waking up every thread sleeping in a futex_wait call during
suspend and resume by calling a freezable blocking call. Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.
This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Android goes through suspend/resume very often (every few seconds when
on a busy wifi network with the screen off), and a significant portion
of the energy used to go in and out of suspend is spent in the
freezer. If a task has called freezer_do_not_count(), don't bother
waking it up. If it happens to wake up later it will call
freezer_count() and immediately enter the refrigerator.
Combined with patches to convert freezable helpers to use
freezer_do_not_count() and convert common sites where idle userspace
tasks are blocked to use the freezable helpers, this reduces the
time and energy required to suspend and resume.
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
All tasks can easily be frozen in under 10 ms, switch to using
an initial 1 ms sleep followed by exponential backoff until
8 ms. Also convert the printed time to ms instead of centiseconds.
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The only existing caller to debug_check_no_locks_held calls it
with 'current' as the task, and the freezer needs to call
debug_check_no_locks_held but doesn't already have a current
task pointer, so remove the argument. It is already assuming
that the current task is relevant by dumping the current stack
trace as part of the warning.
This was originally part of 6aa9707099 (lockdep: check that
no locks held at freeze time) which was reverted in
dbf520a9d7.
Original-author: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Prarit reported a crash on CPU offline/online. The reason is that on
CPU down the NOHZ related per cpu data of the dead cpu is not cleaned
up. If at cpu online an interrupt happens before the per cpu tick
device is registered the irq_enter() check potentially sees stale data
and dereferences a NULL pointer.
Cleanup the data after the cpu is dead.
Reported-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Cc: Mike Galbraith <bitbucket@online.de>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1305031451561.2886@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
kprobes up to par with the latest changes to ftrace (multi buffering
and the new function probes).
He also discovered and fixed some bugs in doing so. When pulling in his
patches, I also found a few minor bugs as well and fixed them.
This also includes a compile fix for some archs that select the ring buffer
but not tracing.
I based this off of the last patch you took from me that fixed the merge
conflict error, as that was the commit that had all the changes I needed
for this set of changes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJRjYnJAAoJEOdOSU1xswtMg9EH/iFs438FgrNMk2ZdQftmqcqA
cqcactHo1mmoHjAoLZT/oDBjEThhVUuqzMXrFRutSYcTh4PsQEC3arX0mpsC+T12
UEEV/tZS3TXH+GXEyrOit/O3kzntQcDHwJDV4+0n80IrJmw4IDZbnV3R8DWjS6wp
so+dq0A1pwehcG/upgpw1oTKsGv1G/p6vyf968B6W44icHEClLiph4JE2kzE6D3r
fzSpOLaQoBEvwIRf6xRKxi240VqIItXwfG7pwNpPpSC37gRLzm74zGr+Sj93/k1y
pARbZ/5XO7/pcVYQYupErRAoV5in+QMZ67k5G1vQIvyOS9r039catbQf/7PkzcI=
=EZCE
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing/kprobes update from Steven Rostedt:
"The majority of these changes are from Masami Hiramatsu bringing
kprobes up to par with the latest changes to ftrace (multi buffering
and the new function probes).
He also discovered and fixed some bugs in doing so. When pulling in
his patches, I also found a few minor bugs as well and fixed them.
This also includes a compile fix for some archs that select the ring
buffer but not tracing.
I based this off of the last patch you took from me that fixed the
merge conflict error, as that was the commit that had all the changes
I needed for this set of changes."
* tag 'trace-fixes-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/kprobes: Support soft-mode disabling
tracing/kprobes: Support ftrace_event_file base multibuffer
tracing/kprobes: Pass trace_probe directly from dispatcher
tracing/kprobes: Increment probe hit-count even if it is used by perf
tracing/kprobes: Use bool for retprobe checker
ftrace: Fix function probe when more than one probe is added
ftrace: Fix the output of enabled_functions debug file
ftrace: Fix locking in register_ftrace_function_probe()
tracing: Add helper function trace_create_new_event() to remove duplicate code
tracing: Modify soft-mode only if there's no other referrer
tracing: Indicate enabled soft-mode in enable file
tracing/kprobes: Fix to increment return event probe hit-count
ftrace: Cleanup regex_lock and ftrace_lock around hash updating
ftrace, kprobes: Fix a deadlock on ftrace_regex_lock
ftrace: Have ftrace_regex_write() return either read or error
tracing: Return error if register_ftrace_function_probe() fails for event_enable_func()
tracing: Don't succeed if event_enable_func did not register anything
ring-buffer: Select IRQ_WORK
Pull audit changes from Eric Paris:
"Al used to send pull requests every couple of years but he told me to
just start pushing them to you directly.
Our touching outside of core audit code is pretty straight forward. A
couple of interface changes which hit net/. A simple argument bug
calling audit functions in namei.c and the removal of some assembly
branch prediction code on ppc"
* git://git.infradead.org/users/eparis/audit: (31 commits)
audit: fix message spacing printing auid
Revert "audit: move kaudit thread start from auditd registration to kaudit init"
audit: vfs: fix audit_inode call in O_CREAT case of do_last
audit: Make testing for a valid loginuid explicit.
audit: fix event coverage of AUDIT_ANOM_LINK
audit: use spin_lock in audit_receive_msg to process tty logging
audit: do not needlessly take a lock in tty_audit_exit
audit: do not needlessly take a spinlock in copy_signal
audit: add an option to control logging of passwords with pam_tty_audit
audit: use spin_lock_irqsave/restore in audit tty code
helper for some session id stuff
audit: use a consistent audit helper to log lsm information
audit: push loginuid and sessionid processing down
audit: stop pushing loginid, uid, sessionid as arguments
audit: remove the old depricated kernel interface
audit: make validity checking generic
audit: allow checking the type of audit message in the user filter
audit: fix build break when AUDIT_DEBUG == 2
audit: remove duplicate export of audit_enabled
Audit: do not print error when LSMs disabled
...
Pull stray syscall bits from Al Viro:
"Several syscall-related commits that were missing from the original"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
switch compat_sys_sysctl to COMPAT_SYSCALL_DEFINE
unicore32: just use mmap_pgoff()...
unify compat fanotify_mark(2), switch to COMPAT_SYSCALL_DEFINE
x86, vm86: fix VM86 syscalls: use SYSCALL_DEFINEx(...)
- Two krealloc() abuse fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iEYEABECAAYFAlGLydsACgkQdwG7hYl686O3ZwCfYIJUdE/VADo53j3SwDbCEa15
pMgAnA13t6dKgC74Xqk5dd65KKaw64WB
=nbL9
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20130509' of git://git.infradead.org/~dwmw2/random-2.6
Pull misc fixes from David Woodhouse:
"This is some miscellaneous cleanups that don't really belong anywhere
else (or were ignored), that have been sitting in linux-next for some
time. Two of them are fixes resulting from my audit of krealloc()
usage that don't seem to have elicited any response when I posted
them, and the other three are patches from Artem removing dead code."
* tag 'for-linus-20130509' of git://git.infradead.org/~dwmw2/random-2.6:
pcmcia: remove RPX board stuff
m68k: remove rpxlite stuff
pcmcia: remove Motorola MBX860 support
params: Fix potential memory leak in add_sysfs_param()
dell-laptop: Fix krealloc() misuse in parse_da_table()
It is a few instructions more efficent to and slightly more
readable to use this_rq()-> instead of cpu_rq(smp_processor_id())-> .
Size comparison of kernel/sched/fair.o:
text data bss dec hex filename
27972 122 26 28120 6dd8 fair.o.before
27956 122 26 28104 6dc8 fair.o.after
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1368116643-87971-1-git-send-email-nzimmer@sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Support soft-mode disabling on kprobe-based dynamic events.
Soft-disabling is just ignoring recording if the soft disabled
flag is set.
Link: http://lkml.kernel.org/r/20130509054454.30398.7237.stgit@mhiramat-M0-7522
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Increment probe hit-count for profiling even if it is used
by perf tool. Same thing has already done in trace_uprobe.
Link: http://lkml.kernel.org/r/20130509054436.30398.21133.stgit@mhiramat-M0-7522
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The enabled_functions debugfs file was created to be able to see
what functions have been modified from nops to calling a tracer.
The current method uses the counter in the function record.
As when a ftrace_ops is registered to a function, its count
increases. But that doesn't mean that the function is actively
being traced. /proc/sys/kernel/ftrace_enabled can be set to zero
which would disable it, as well as something can go wrong and
we can think its enabled when only the counter is set.
The record's FTRACE_FL_ENABLED flag is set or cleared when its
function is modified. That is a much more accurate way of knowing
what function is enabled or not.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The iteration of the ftrace function list and the call to
ftrace_match_record() need to be protected by the ftrace_lock.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Both __trace_add_new_event() and __trace_early_add_new_event() do
basically the same thing, except that __trace_add_new_event() does
a little more.
Instead of having duplicate code between the two functions, add
a helper function trace_create_new_event() that both can use.
This will help against having bugs fixed in one function but not
the other.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Modify soft-mode flag only if no other soft-mode referrer
(currently only the ftrace triggers) by using a reference
counter in each ftrace_event_file.
Without this fix, adding and removing several different
enable/disable_event triggers on the same event clear
soft-mode bit from the ftrace_event_file. This also
happens with a typo of glob on setting triggers.
e.g.
# echo vfs_symlink:enable_event:net:netif_rx > set_ftrace_filter
# cat events/net/netif_rx/enable
0*
# echo typo_func:enable_event:net:netif_rx > set_ftrace_filter
# cat events/net/netif_rx/enable
0
# cat set_ftrace_filter
#### all functions enabled ####
vfs_symlink:enable_event:net:netif_rx:unlimited
As above, we still have a trigger, but soft-mode is gone.
Link: http://lkml.kernel.org/r/20130509054429.30398.7464.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Indicate enabled soft-mode event as "1*" in "enable" file
for each event, because it can be soft-disabled when disable_event
trigger is hit.
Link: http://lkml.kernel.org/r/20130509054426.30398.28202.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cleanup regex_lock and ftrace_lock locking points around
ftrace_ops hash update code.
The new rule is that regex_lock protects ops->*_hash
read-update-write code for each ftrace_ops. Usually,
hash update is done by following sequence.
1. allocate a new local hash and copy the original hash.
2. update the local hash.
3. move(actually, copy) back the local hash to ftrace_ops.
4. update ftrace entries if needed.
5. release the local hash.
This makes regex_lock protect #1-#4, and ftrace_lock
to protect #3, #4 and adding and removing ftrace_ops from the
ftrace_ops_list. The ftrace_lock protects #3 as well because
the move functions update the entries too.
Link: http://lkml.kernel.org/r/20130509054421.30398.83411.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix a deadlock on ftrace_regex_lock which happens when setting
an enable_event trigger on dynamic kprobe event as below.
----
sh-2.05b# echo p vfs_symlink > kprobe_events
sh-2.05b# echo vfs_symlink:enable_event:kprobes:p_vfs_symlink_0 > set_ftrace_filter
=============================================
[ INFO: possible recursive locking detected ]
3.9.0+ #35 Not tainted
---------------------------------------------
sh/72 is trying to acquire lock:
(ftrace_regex_lock){+.+.+.}, at: [<ffffffff810ba6c1>] ftrace_set_hash+0x81/0x1f0
but task is already holding lock:
(ftrace_regex_lock){+.+.+.}, at: [<ffffffff810b7cbd>] ftrace_regex_write.isra.29.part.30+0x3d/0x220
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(ftrace_regex_lock);
lock(ftrace_regex_lock);
*** DEADLOCK ***
----
To fix that, this introduces a finer regex_lock for each ftrace_ops.
ftrace_regex_lock is too big of a lock which protects all
filter/notrace_hash operations, but it doesn't need to be a global
lock after supporting multiple ftrace_ops because each ftrace_ops
has its own filter/notrace_hash.
Link: http://lkml.kernel.org/r/20130509054417.30398.84254.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ Added initialization flag and automate mutex initialization for
non ftrace.c ftrace_probes. ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As ftrace_regex_write() reads the result of ftrace_process_regex()
which can sometimes return a positive number, only consider a
failure if the return is negative. Otherwise, it will skip possible
other registered probes and by returning a positive number that
wasn't read, it will confuse the user processes doing the writing.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
register_ftrace_function_probe() returns the number of functions
it registered, which can be zero, it can also return a negative number
if something went wrong. But event_enable_func() only checks for
the case that it didn't register anything, it needs to also check
for the case that something went wrong and return that error code
as well.
Added some comments about the code as well, to make it more
understandable.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Return 0 instead of the number of activated ftrace function probes if
event_enable_func succeeded and return an error code if it failed or
did not register any functions. But it currently returns the number
of registered functions and if it didn't register anything, it returns 0,
but that is considered success.
This also fixes the return value. As if it succeeds, it returns the
number of functions that were enabled, which is returned back to
the user in ftrace_regex_write (the write() return code). If only
one function is enabled, then the return code of the write is one,
and this can confuse the user program in thinking it only wrote 1
byte.
Link: http://lkml.kernel.org/r/20130509054413.30398.55650.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ Rewrote change log to reflect that this fixes two bugs - SR ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull block driver updates from Jens Axboe:
"It might look big in volume, but when categorized, not a lot of
drivers are touched. The pull request contains:
- mtip32xx fixes from Micron.
- A slew of drbd updates, this time in a nicer series.
- bcache, a flash/ssd caching framework from Kent.
- Fixes for cciss"
* 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
bcache: Use bd_link_disk_holder()
bcache: Allocator cleanup/fixes
cciss: bug fix to prevent cciss from loading in kdump crash kernel
cciss: add cciss_allow_hpsa module parameter
drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
mtip32xx: Workaround for unaligned writes
bcache: Make sure blocksize isn't smaller than device blocksize
bcache: Fix merge_bvec_fn usage for when it modifies the bvm
bcache: Correctly check against BIO_MAX_PAGES
bcache: Hack around stuff that clones up to bi_max_vecs
bcache: Set ra_pages based on backing device's ra_pages
bcache: Take data offset from the bdev superblock.
mtip32xx: mtip32xx: Disable TRIM support
mtip32xx: fix a smatch warning
bcache: Disable broken btree fuzz tester
bcache: Fix a format string overflow
bcache: Fix a minor memory leak on device teardown
bcache: Documentation updates
bcache: Use WARN_ONCE() instead of __WARN()
bcache: Add missing #include <linux/prefetch.h>
...
Pull block core updates from Jens Axboe:
- Major bit is Kents prep work for immutable bio vecs.
- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.
- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.
- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.
- Runtime PM framework, SCSI patches exists on top of these in James'
tree.
- A few random fixes.
* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...
The helper function didn't include a leading space, so it was jammed
against the previous text in the audit record.
Signed-off-by: Eric Paris <eparis@redhat.com>
Merge more incoming from Andrew Morton:
- Various fixes which were stalled or which I picked up recently
- A large rotorooting of the AIO code. Allegedly to improve
performance but I don't really have good performance numbers (I might
have lost the email) and I can't raise Kent today. I held this out
of 3.9 and we could give it another cycle if it's all too late/scary.
I ended up taking only the first two thirds of the AIO rotorooting. I
left the percpu parts and the batch completion for later. - Linus
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (33 commits)
aio: don't include aio.h in sched.h
aio: kill ki_retry
aio: kill ki_key
aio: give shared kioctx fields their own cachelines
aio: kill struct aio_ring_info
aio: kill batch allocation
aio: change reqs_active to include unreaped completions
aio: use cancellation list lazily
aio: use flush_dcache_page()
aio: make aio_read_evt() more efficient, convert to hrtimers
wait: add wait_event_hrtimeout()
aio: refcounting cleanup
aio: make aio_put_req() lockless
aio: do fget() after aio_get_req()
aio: dprintk() -> pr_debug()
aio: move private stuff out of aio.h
aio: add kiocb_cancel()
aio: kill return value of aio_complete()
char: add aio_{read,write} to /dev/{null,zero}
aio: remove retry-based AIO
...
This reverts commit 6ff5e45985.
Conflicts:
kernel/audit.c
This patch was starting a kthread for all the time. Since the follow on
patches that required it didn't get finished in 3.10 time, we shouldn't
ship this change in 3.10.
Signed-off-by: Eric Paris <eparis@redhat.com>
audit rule additions containing "-F auid!=4294967295" were failing
with EINVAL because of a regression caused by e1760bd.
Apparently some userland audit rule sets want to know if loginuid uid
has been set and are using a test for auid != 4294967295 to determine
that.
In practice that is a horrible way to ask if a value has been set,
because it relies on subtle implementation details and will break
every time the uid implementation in the kernel changes.
So add a clean way to test if the audit loginuid has been set, and
silently convert the old idiom to the cleaner and more comprehensible
new idiom.
Cc: <stable@vger.kernel.org> # 3.7
Reported-By: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Tested-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Add perf_event_aux() function to send out all types of
auxiliary events - mmap, task, comm events. For each type
there's match and output functions defined and used as
callbacks during perf_event_aux processing.
This way we can centralize the pmu/context iterating and
event matching logic. Also since lot of the code was
duplicated, this patch reduces the .text size about 2kB
on my setup:
snipped output from 'objdump -x kernel/events/core.o'
before:
Idx Name Size
0 .text 0000d313
after:
Idx Name Size
0 .text 0000cad3
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/1367857638-27631-3-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The perf_event_task_ctx() function needs to be called with
preemption disabled, since it's checking for currently
scheduled cpu against event cpu.
We disable preemption for task related perf event context
if there's one defined, leaving up to the chance which cpu
it gets scheduled in.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/1367857638-27631-2-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
These inlines are only used by kernel/sched/fair.c so they do
not need to be present in the main kernel/sched/sched.h file.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1366398650-31599-3-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This large chunk of load calculation code can be easily divorced
from the main core.c scheduler file, with only a couple
prototypes and externs added to a kernel/sched header.
Some recent commits expanded the code and the documentation of
it, making it large enough to warrant separation. For example,
see:
556061b, "sched/nohz: Fix rq->cpu_load[] calculations"
5aaa0b7, "sched/nohz: Fix rq->cpu_load calculations some more"
5167e8d, "sched/nohz: Rewrite and fix load-avg computation -- again"
More importantly, it helps reduce the size of the main
sched/core.c by yet another significant amount (~600 lines).
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1366398650-31599-2-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some interrupt controllers refuse to map interrupts marked as
"protected" by firwmare. Since we try to map everyting in the
device-tree on some platforms, we end up with a lot of nasty
WARN's in the boot log for what is a normal situation on those
machines.
This defines a specific return code (-EPERM) from the host map()
callback which cause irqdomain to fail silently.
MPIC is updated to return this when hitting a protected source
printing only a single line message for diagnostic purposes.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull 'full dynticks' support from Ingo Molnar:
"This tree from Frederic Weisbecker adds a new, (exciting! :-) core
kernel feature to the timer and scheduler subsystems: 'full dynticks',
or CONFIG_NO_HZ_FULL=y.
This feature extends the nohz variable-size timer tick feature from
idle to busy CPUs (running at most one task) as well, potentially
reducing the number of timer interrupts significantly.
This feature got motivated by real-time folks and the -rt tree, but
the general utility and motivation of full-dynticks runs wider than
that:
- HPC workloads get faster: CPUs running a single task should be able
to utilize a maximum amount of CPU power. A periodic timer tick at
HZ=1000 can cause a constant overhead of up to 1.0%. This feature
removes that overhead - and speeds up the system by 0.5%-1.0% on
typical distro configs even on modern systems.
- Real-time workload latency reduction: CPUs running critical tasks
should experience as little jitter as possible. The last remaining
source of kernel-related jitter was the periodic timer tick.
- A single task executing on a CPU is a pretty common situation,
especially with an increasing number of cores/CPUs, so this feature
helps desktop and mobile workloads as well.
The cost of the feature is mainly related to increased timer
reprogramming overhead when a CPU switches its tick period, and thus
slightly longer to-idle and from-idle latency.
Configuration-wise a third mode of operation is added to the existing
two NOHZ kconfig modes:
- CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
as a config option. This is the traditional Linux periodic tick
design: there's a HZ tick going on all the time, regardless of
whether a CPU is idle or not.
- CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
periodic tick when a CPU enters idle mode.
- CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
tick when a CPU is idle, also slows the tick down to 1 Hz (one
timer interrupt per second) when only a single task is running on a
CPU.
The .config behavior is compatible: existing !CONFIG_NO_HZ and
CONFIG_NO_HZ=y settings get translated to the new values, without the
user having to configure anything. CONFIG_NO_HZ_FULL is turned off by
default.
This feature is based on a lot of infrastructure work that has been
steadily going upstream in the last 2-3 cycles: related RCU support
and non-periodic cputime support in particular is upstream already.
This tree adds the final pieces and activates the feature. The pull
request is marked RFC because:
- it's marked 64-bit only at the moment - the 32-bit support patch is
small but did not get ready in time.
- it has a number of fresh commits that came in after the merge
window. The overwhelming majority of commits are from before the
merge window, but still some aspects of the tree are fresh and so I
marked it RFC.
- it's a pretty wide-reaching feature with lots of effects - and
while the components have been in testing for some time, the full
combination is still not very widely used. That it's default-off
should reduce its regression abilities and obviously there are no
known regressions with CONFIG_NO_HZ_FULL=y enabled either.
- the feature is not completely idempotent: there is no 100%
equivalent replacement for a periodic scheduler/timer tick. In
particular there's ongoing work to map out and reduce its effects
on scheduler load-balancing and statistics. This should not impact
correctness though, there are no known regressions related to this
feature at this point.
- it's a pretty ambitious feature that with time will likely be
enabled by most Linux distros, and we'd like you to make input on
its design/implementation, if you dislike some aspect we missed.
Without flaming us to crisp! :-)
Future plans:
- there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
the periodic tick altogether when there's a single busy task on a
CPU. We'd first like 1 Hz to be exposed more widely before we go
for the 0 Hz target though.
- once we reach 0 Hz we can remove the periodic tick assumption from
nr_running>=2 as well, by essentially interrupting busy tasks only
as frequently as the sched_latency constraints require us to do -
once every 4-40 msecs, depending on nr_running.
I am personally leaning towards biting the bullet and doing this in
v3.10, like the -rt tree this effort has been going on for too long -
but the final word is up to you as usual.
More technical details can be found in Documentation/timers/NO_HZ.txt"
* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
sched: Keep at least 1 tick per second for active dynticks tasks
rcu: Fix full dynticks' dependency on wide RCU nocb mode
nohz: Protect smp_processor_id() in tick_nohz_task_switch()
nohz_full: Add documentation.
cputime_nsecs: use math64.h for nsec resolution conversion helpers
nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
nohz: Reduce overhead under high-freq idling patterns
nohz: Remove full dynticks' superfluous dependency on RCU tree
nohz: Fix unavailable tick_stop tracepoint in dynticks idle
nohz: Add basic tracing
nohz: Select wide RCU nocb for full dynticks
nohz: Disable the tick when irq resume in full dynticks CPU
nohz: Re-evaluate the tick for the new task after a context switch
nohz: Prepare to stop the tick on irq exit
nohz: Implement full dynticks kick
nohz: Re-evaluate the tick from the scheduler IPI
sched: New helper to prevent from stopping the tick in full dynticks
sched: Kick full dynticks CPU that have more than one task enqueued.
perf: New helper to prevent full dynticks CPUs from stopping tick
perf: Kick full dynticks CPU if events rotation is needed
...
Pull perf fixes from Ingo Molnar:
"Misc fixes plus a small hw-enablement patch for Intel IB model 58
uncore events"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/lbr: Demand proper privileges for PERF_SAMPLE_BRANCH_KERNEL
perf/x86/intel/lbr: Fix LBR filter
perf/x86: Blacklist all MEM_*_RETIRED events for Ivy Bridge
perf: Fix vmalloc ring buffer pages handling
perf/x86/intel: Fix unintended variable name reuse
perf/x86/intel: Add support for IvyBridge model 58 Uncore
perf/x86/intel: Fix typo in perf_event_intel_uncore.c
x86: Eliminate irq_mis_count counted in arch_irq_stat
fix a remove/insert race which Never Happens, and (my favorite) handle the
case when we have too many modules for a single commandline. Seriously,
the kernel is full, please go away!
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJRgbGEAAoJENkgDmzRrbjx2+QP/jXs93K/sXw3rL0vBklwCFv6
IPZmqYZiGjrzqlB4coWkgYRwW1oOsREfAjF5MmfPdykS3fO5kXfdxN4FBdfKp+IZ
RdsycdGDuSxWomgYsivrrxLBDxDAX1VuBOjr6mu5Uuk/pCjFa61cfJDiErsu0jKz
2EMTc98A+E71XamJdvbtal5MUIu9yeluJWG2ux2+VbCul4MSpMc//0n2nrws/RCB
AoC96AT/Xf4U10a8zT8RfCJ29M5Vvx/KfTIcFiZvtCQxEaHNNmj831gDNiw/3jFI
ndRph+VLHBsMoBMxfzNRrM+evqkq8+AGEGRj3ycQy5Pa6DunPyzMafWOVGBGnmaS
tl9hATGx1438048i5tUn8ieAYG1YL1HM83hQovpCThfUKQMiq186iDt1SYYmlq3g
0thj3znQqZDYhboPtgWzOMUdqOG/iBIKjhGQjjHZs+MInFgxL2hmax0gBNkvEtQb
oLyfGbF6UjS7I/Md/HohnUQ4xr9kYa3MQeqPjKbRwgHRkdXhzTEZtI+MYDJBxOnW
QGVQ97aJ2WA7vC7sz/1VhTcZqmU5zfrSc8lF+Ea+H8dQGHHbz8HxKQacEvKcMrXl
OJyEkRUWDA0MTjeIHzn2fff9Q6/qqA1QejRiFofGJrpxopcJS84/7yA0repxvuMG
yaMPsLq53UW37/AXYsho
=MPiD
-----END PGP SIGNATURE-----
Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull mudule updates from Rusty Russell:
"We get rid of the general module prefix confusion with a binary config
option, fix a remove/insert race which Never Happens, and (my
favorite) handle the case when we have too many modules for a single
commandline. Seriously, the kernel is full, please go away!"
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
modpost: fix unwanted VMLINUX_SYMBOL_STR expansion
X.509: Support parse long form of length octets in Authority Key Identifier
module: don't unlink the module until we've removed all exposure.
kernel: kallsyms: memory override issue, need check destination buffer length
MODSIGN: do not send garbage to stderr when enabling modules signature
modpost: handle huge numbers of modules.
modpost: add -T option to read module names from file/stdin.
modpost: minor cleanup.
genksyms: pass symbol-prefix instead of arch
module: fix symbol versioning with symbol prefixes
CONFIG_SYMBOL_PREFIX: cleanup.
commit b352bc1cbc (tick: Convert broadcast cpu bitmaps to
cpumask_var_t) broke CONFIG_CPUMASK_OFFSTACK in a very subtle way.
Instead of allocating the cpumasks with zalloc_cpumask_var it uses
alloc_cpumask_var, so we can get random data there, which of course
confuses the logic completely and causes random failures.
Reported-and-tested-by: Dave Jones <davej@redhat.com>
Reported-and-tested-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1305032015060.2990@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull second round of VFS updates from Al Viro:
"Assorted fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
xtensa simdisk: fix braino in "xtensa simdisk: switch to proc_create_data()"
hostfs: use kmalloc instead of kzalloc
hostfs: move HOSTFS_SUPER_MAGIC to <linux/magic.h>
hostfs: remove "will unlock" comment
vfs: use list_move instead of list_del/list_add
proc_devtree: Replace include linux/module.h with linux/export.h
create_mnt_ns: unidiomatic use of list_add()
fs: remove dentry_lru_prune()
Removed unused typedef to avoid "unused local typedef" warnings.
kill fs/read_write.h
fs: Fix hang with BSD accounting on frozen filesystem
sun3_scsi: add ->show_info()
nubus: Kill nubus_proc_detach_device()
more mode_t whack-a-mole...
do_coredump(): don't wait for thaw if coredump has already been interrupted
do_mount(): fix a leak introduced in 3.9 ("mount: consolidate permission checks")
When BSD process accounting is enabled and logs information to a
filesystem which gets frozen, system easily becomes unusable because
each attempt to account process information blocks. Thus e.g. every task
gets blocked in exit.
It seems better to drop accounting information (which can already happen
when filesystem is running out of space) instead of locking system up.
So we just skip the write if the filesystem is frozen.
Reported-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The scheduler doesn't yet fully support environments
with a single task running without a periodic tick.
In order to ensure we still maintain the duties of scheduler_tick(),
keep at least 1 tick per second.
This makes sure that we keep the progression of various scheduler
accounting and background maintainance even with a very low granularity.
Examples include cpu load, sched average, CFS entity vruntime,
avenrun and events such as load balancing, amongst other details
handled in sched_class::task_tick().
This limitation will be removed in the future once we get
these individual items to work in full dynticks CPUs.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Commit 0637e02939
("nohz: Select wide RCU nocb for full dynticks") intended
to force CONFIG_RCU_NOCB_CPU_ALL=y when full dynticks is
enabled.
However this option is part of a choice menu and Kconfig's
"select" instruction has no effect on such targets.
Fix this by using reverse dependencies on the targets we
don't want instead.
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
As the wake up logic for waiters on the buffer has been moved
from the tracing code to the ring buffer, it requires also adding
IRQ_WORK as the wake up code is performed via irq_work.
This fixes compile breakage when a user of the ring buffer is selected
but tracing and irq_work are not.
Link http://lkml.kernel.org/r/20130503115332.GT8356@rric.localhost
Cc: Arnd Bergmann <arnd@arndb.de>
Reported-by: Robert Richter <rric@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull drm updates from Dave Airlie:
"This is the main drm pull request for 3.10.
Wierd bits:
- OMAP drm changes required OMAP dss changes, in drivers/video, so I
took them in here.
- one more fbcon fix for font handover
- VT switch avoidance in pm code
- scatterlist helpers for gpu drivers - have acks from akpm
Highlights:
- qxl kms driver - driver for the spice qxl virtual GPU
Nouveau:
- fermi/kepler VRAM compression
- GK110/nvf0 modesetting support.
Tegra:
- host1x core merged with 2D engine support
i915:
- vt switchless resume
- more valleyview support
- vblank fixes
- modesetting pipe config rework
radeon:
- UVD engine support
- SI chip tiling support
- GPU registers initialisation from golden values.
exynos:
- device tree changes
- fimc block support
Otherwise:
- bunches of fixes all over the place."
* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (513 commits)
qxl: update to new idr interfaces.
drm/nouveau: fix build with nv50->nvc0
drm/radeon: fix handling of v6 power tables
drm/radeon: clarify family checks in pm table parsing
drm/radeon: consolidate UVD clock programming
drm/radeon: fix UPLL_REF_DIV_MASK definition
radeon: add bo tracking debugfs
drm/radeon: add new richland pci ids
drm/radeon: add some new SI PCI ids
drm/radeon: fix scratch reg handling for UVD fence
drm/radeon: allocate SA bo in the requested domain
drm/radeon: fix possible segfault when parsing pm tables
drm/radeon: fix endian bugs in atom_allocate_fb_scratch()
OMAPDSS: TFP410: return EPROBE_DEFER if the i2c adapter not found
OMAPDSS: VENC: Add error handling for venc_probe_pdata
OMAPDSS: HDMI: Add error handling for hdmi_probe_pdata
OMAPDSS: RFBI: Add error handling for rfbi_probe_pdata
OMAPDSS: DSI: Add error handling for dsi_probe_pdata
OMAPDSS: SDI: Add error handling for sdi_probe_pdata
OMAPDSS: DPI: Add error handling for dpi_probe_pdata
...
Pull scheduler fixes from Ingo Molnar:
"This fixes the cputime scaling overflow problems for good without
having bad 32-bit overhead, and gets rid of the div64_u64_rem() helper
as well."
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "math64: New div64_u64_rem helper"
sched: Avoid prev->stime underflow
sched: Do not account bogus utime
sched: Avoid cputime scaling overflow
The full dynticks tree needs the latest RCU and sched
upstream updates in order to fix some dependencies.
Merge a common upstream merge point that has these
updates.
Conflicts:
include/linux/perf_event.h
kernel/rcutree.h
kernel/rcutree_plugin.h
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Pull VFS updates from Al Viro,
Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).
7kloc removed.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...
Split the proc namespace stuff out into linux/proc_ns.h.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: netdev@vger.kernel.org
cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull networking updates from David Miller:
"Highlights (1721 non-merge commits, this has to be a record of some
sort):
1) Add 'random' mode to team driver, from Jiri Pirko and Eric
Dumazet.
2) Make it so that any driver that supports configuration of multiple
MAC addresses can provide the forwarding database add and del
calls by providing a default implementation and hooking that up if
the driver doesn't have an explicit set of handlers. From Vlad
Yasevich.
3) Support GSO segmentation over tunnels and other encapsulating
devices such as VXLAN, from Pravin B Shelar.
4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton.
5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita
Dukkipati.
6) In the PHY layer, allow supporting wake-on-lan in situations where
the PHY registers have to be written for it to be configured.
Use it to support wake-on-lan in mv643xx_eth.
From Michael Stapelberg.
7) Significantly improve firewire IPV6 support, from YOSHIFUJI
Hideaki.
8) Allow multiple packets to be sent in a single transmission using
network coding in batman-adv, from Martin Hundebøll.
9) Add support for T5 cxgb4 chips, from Santosh Rastapur.
10) Generalize the VXLAN forwarding tables so that there is more
flexibility in configurating various aspects of the endpoints.
From David Stevens.
11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver,
from Dmitry Kravkov.
12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo
Neira Ayuso.
13) Start adding networking selftests.
14) In situations of overload on the same AF_PACKET fanout socket, or
per-cpu packet receive queue, minimize drop by distributing the
load to other cpus/fanouts. From Willem de Bruijn and Eric
Dumazet.
15) Add support for new payload offset BPF instruction, from Daniel
Borkmann.
16) Convert several drivers over to mdoule_platform_driver(), from
Sachin Kamat.
17) Provide a minimal BPF JIT image disassembler userspace tool, from
Daniel Borkmann.
18) Rewrite F-RTO implementation in TCP to match the final
specification of it in RFC4138 and RFC5682. From Yuchung Cheng.
19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear
you like netlink, so I implemented netlink dumping of netlink
sockets.") From Andrey Vagin.
20) Remove ugly passing of rtnetlink attributes into rtnl_doit
functions, from Thomas Graf.
21) Allow userspace to be able to see if a configuration change occurs
in the middle of an address or device list dump, from Nicolas
Dichtel.
22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes
Frederic Sowa.
23) Increase accuracy of packet length used by packet scheduler, from
Jason Wang.
24) Beginning set of changes to make ipv4/ipv6 fragment handling more
scalable and less susceptible to overload and locking contention,
from Jesper Dangaard Brouer.
25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_*()
instead. From Hong Zhiguo.
26) Optimize route usage in IPVS by avoiding reference counting where
possible, from Julian Anastasov.
27) Convert IPVS schedulers to RCU, also from Julian Anastasov.
28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger
Eitzenberger.
29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG,
nfnetlink_log, and nfnetlink_queue. From Gao feng.
30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa.
31) Support several new r8169 chips, from Hayes Wang.
32) Support tokenized interface identifiers in ipv6, from Daniel
Borkmann.
33) Use usbnet_link_change() helper in USB net driver, from Ming Lei.
34) Add 802.1ad vlan offload support, from Patrick McHardy.
35) Support mmap() based netlink communication, also from Patrick
McHardy.
36) Support HW timestamping in mlx4 driver, from Amir Vadai.
37) Rationalize AF_PACKET packet timestamping when transmitting, from
Willem de Bruijn and Daniel Borkmann.
38) Bring parity to what's provided by /proc/net/packet socket dumping
and the info provided by netlink socket dumping of AF_PACKET
sockets. From Nicolas Dichtel.
39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin
Poirier"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
filter: fix va_list build error
af_unix: fix a fatal race with bit fields
bnx2x: Prevent memory leak when cnic is absent
bnx2x: correct reading of speed capabilities
net: sctp: attribute printl with __printf for gcc fmt checks
netlink: kconfig: move mmap i/o into netlink kconfig
netpoll: convert mutex into a semaphore
netlink: Fix skb ref counting.
net_sched: act_ipt forward compat with xtables
mlx4_en: fix a build error on 32bit arches
Revert "bnx2x: allow nvram test to run when device is down"
bridge: avoid OOPS if root port not found
drivers: net: cpsw: fix kernel warn on cpsw irq enable
sh_eth: use random MAC address if no valid one supplied
3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA)
tg3: fix to append hardware time stamping flags
unix/stream: fix peeking with an offset larger than data in queue
unix/dgram: fix peeking with an offset larger than data in queue
unix/dgram: peek beyond 0-sized skbs
openvswitch: Remove unneeded ovs_netdev_get_ifindex()
...
Pull compat cleanup from Al Viro:
"Mostly about syscall wrappers this time; there will be another pile
with patches in the same general area from various people, but I'd
rather push those after both that and vfs.git pile are in."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
syscalls.h: slightly reduce the jungles of macros
get rid of union semop in sys_semctl(2) arguments
make do_mremap() static
sparc: no need to sign-extend in sync_file_range() wrapper
ppc compat wrappers for add_key(2) and request_key(2) are pointless
x86: trim sys_ia32.h
x86: sys32_kill and sys32_mprotect are pointless
get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
merge compat sys_ipc instances
consolidate compat lookup_dcookie()
convert vmsplice to COMPAT_SYSCALL_DEFINE
switch getrusage() to COMPAT_SYSCALL_DEFINE
switch epoll_pwait to COMPAT_SYSCALL_DEFINE
convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect
make HAVE_SYSCALL_WRAPPERS unconditional
consolidate cond_syscall and SYSCALL_ALIAS declarations
teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long
get rid of duplicate logics in __SC_....[1-6] definitions
If we allocate perf ring buffer with the size of single (user)
page, we will get memory corruption when releasing itin
rb_free_work function (for CONFIG_PERF_USE_VMALLOC option).
For single page sized ring buffer the page_order is -1 (because
nr_pages is 0). This needs to be recognized in the rb_free_work
function to release proper amount of pages.
Adding data_page_nr function that returns number of allocated
data pages. Customizing the rest of the code to use it.
Reported-by: Jan Stancek <jstancek@redhat.com>
Original-patch-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/20130319143509.GA1128@krava.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge third batch of fixes from Andrew Morton:
"Most of the rest. I still have two large patchsets against AIO and
IPC, but they're a bit stuck behind other trees and I'm about to
vanish for six days.
- random fixlets
- inotify
- more of the MM queue
- show_stack() cleanups
- DMI update
- kthread/workqueue things
- compat cleanups
- epoll udpates
- binfmt updates
- nilfs2
- hfs
- hfsplus
- ptrace
- kmod
- coredump
- kexec
- rbtree
- pids
- pidns
- pps
- semaphore tweaks
- some w1 patches
- relay updates
- core Kconfig changes
- sysrq tweaks"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (109 commits)
Documentation/sysrq: fix inconstistent help message of sysrq key
ethernet/emac/sysrq: fix inconstistent help message of sysrq key
sparc/sysrq: fix inconstistent help message of sysrq key
powerpc/xmon/sysrq: fix inconstistent help message of sysrq key
ARM/etm/sysrq: fix inconstistent help message of sysrq key
power/sysrq: fix inconstistent help message of sysrq key
kgdb/sysrq: fix inconstistent help message of sysrq key
lib/decompress.c: fix initconst
notifier-error-inject: fix module names in Kconfig
kernel/sys.c: make prctl(PR_SET_MM) generally available
UAPI: remove empty Kbuild files
menuconfig: print more info for symbol without prompts
init/Kconfig: re-order CONFIG_EXPERT options to fix menuconfig display
kconfig menu: move Virtualization drivers near other virtualization options
Kconfig: consolidate CONFIG_DEBUG_STRICT_USER_COPY_CHECKS
relay: use macro PAGE_ALIGN instead of FIX_SIZE
kernel/relay.c: move FIX_SIZE macro into relay.c
kernel/relay.c: remove unused function argument actor
drivers/w1/slaves/w1_ds2760.c: fix the error handling in w1_ds2760_add_slave()
drivers/w1/slaves/w1_ds2781.c: fix the error handling in w1_ds2781_add_slave()
...
Currently help message of /proc/sysrq-trigger highlight its
upper-case characters, like below:
SysRq : HELP : loglevel(0-9) reBoot Crash terminate-all-tasks(E)
memory-full-oom-kill(F) kill-all-tasks(I) ...
this would confuse user trigger sysrq by upper-case character, which is
inconsistent with the real lower-case character registed key.
This inconsistent help message will also lead more confused when
26 upper-case letters put into use in future.
This patch fix power off sysrq key: "poweroff(o)"
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently help message of /proc/sysrq-trigger highlight its upper-case
characters, like below:
SysRq : HELP : loglevel(0-9) reBoot Crash terminate-all-tasks(E)
memory-full-oom-kill(F) kill-all-tasks(I) ...
this would confuse user trigger sysrq by upper-case character, which is
inconsistent with the real lower-case character registed key.
This inconsistent help message will also lead more confused when
26 upper-case letters put into use in future.
This patch fix kgdb sysrq key: "debug(g)"
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The purpose of this patch is to allow privileged processes to set
their own per-memory memory-region fields:
start_code, end_code, start_data, end_data, start_brk, brk,
start_stack, arg_start, arg_end, env_start, env_end.
This functionality is needed by any application or package that needs to
reconstruct Linux processes, that is, to start them in any way other than
by means of an "execve()" from an executable file. This includes:
1. Restoring processes from a checkpoint-file (by all potential
user-level checkpointing packages, not only CRIU's).
2. Restarting processes on another node after process migration.
3. Starting duplicated copies of a running process (for reliability
and high-availablity).
4. Starting a process from an executable format that is not supported
by Linux, thus requiring a "manual execve" by a user-level utility.
5. Similarly, starting a process from a networked and/or crypted
executable that, for confidentiality, licensing or other reasons,
may not be written to the local file-systems.
The code that does that was already included in the Linux kernel by the
CRIU group, in the form of "prctl(PR_SET_MM)", but prior to this was
enclosed within their private "#ifdef CONFIG_CHECKPOINT_RESTORE", which is
normally disabled. The patch removes those ifdefs.
Signed-off-by: Amnon Shiloh <u3557@miso.sublimeip.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Macro FIX_SIZE is same as PAGE_ALIGN at present, so use PAGE_ALIGN
instead.
Thanks Andrew found this.
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's better to place FIX_SIZE macro in relay.c, instead of relay.h
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently argument `actor' is never used in the relay reading path, so
remove it.
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move BITS_PER_PAGE from pid_namespace.c to pid_namespace.h, since we can
simplify the define PID_MAP_ENTRIES by using the BITS_PER_PAGE.
[akpm@linux-foundation.org: kernel/pid.c:54:1: warning: "BITS_PER_PAGE" redefined]
Signed-off-by: Raphael S.Carvalho <raphael.scarv@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
find_next_offset() searches for an available "cleaned bit" in the
respective pid bitmap (page), so returns the offset if found, otherwise
it returns a value equals to BITS_PER_PAGE.
For example, suppose find_next_offset didn't find any available bit, so
there's no purpose to call mk_pid (Wasteful Cpu Cycles).
Therefore, I found it could be better to call mk_pid after the checking
(offset < BITS_PER_PAGE) returned sucessfully! Another point: If (offset
< BITS_PER_PAGE) results in a "failure", then mk_pid would be called
again afterwards.
[akpm@linux-foundation.org: simplify code]
Signed-off-by: Raphael S. Carvalho <raphael.scarv@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The types of the following local variables:
- ubytes/mbytes in kimage_load_crash_segment()/kimage_load_normal_segment()
- r in vmcoreinfo_append_str()
are wrong, so fix them.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Simon Horman <horms@verge.net.au>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are 2 well known and ancient problems with coredump/signals, and a
lot of related bug reports:
- do_coredump() clears TIF_SIGPENDING but of course this can't help
if, say, SIGCHLD comes after that.
In this case the coredump can fail unexpectedly. See for example
wait_for_dump_helper()->signal_pending() check but there are other
reasons.
- At the same time, dumping a huge core on the slow media can take a
lot of time/resources and there is no way to kill the coredumping
task reliably. In particular this is not oom_kill-friendly.
This patch tries to fix the 1st problem, and makes the preparation for the
next changes.
We add the new SIGNAL_GROUP_COREDUMP flag set by zap_threads() to indicate
that this process dumps the core. prepare_signal() checks this flag and
nacks any signal except SIGKILL.
Note that this check tries to be conservative, in the long term we should
probably treat the SIGNAL_GROUP_EXIT case equally but this needs more
discussion. See marc.info/?l=linux-kernel&m=120508897917439
Notes:
- recalc_sigpending() doesn't check SIGNAL_GROUP_COREDUMP.
The patch assumes that dump_write/etc paths should never
call it, but we can change it as well.
- There is another source of TIF_SIGPENDING, freezer. This
will be addressed separately.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Mandeep Singh Baines <msb@chromium.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Neil Horman <nhorman@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function suffers from not being able to determine if the cleanup is
called in case it returns -ENOMEM. Nobody is using it anymore, so let's
remove it.
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use call_usermodehelper_setup() + call_usermodehelper_exec() instead of
calling call_usermodehelper_fns(). In case the latter returns -ENOMEM the
cleanup function may had not been called - in this case we would not free
argv and module_name.
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
call_usermodehelper_setup() + call_usermodehelper_exec() need to be
called instead of call_usermodehelper_fns() when the cleanup function
needs to be called even when an ENOMEM error occurs. In this case using
call_usermodehelper_fns() the user can't distinguish if the cleanup
function was called or not.
[akpm@linux-foundation.org: export call_usermodehelper_setup() to modules]
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds a new ptrace request PTRACE_PEEKSIGINFO.
This request is used to retrieve information about pending signals
starting with the specified sequence number. Siginfo_t structures are
copied from the child into the buffer starting at "data".
The argument "addr" is a pointer to struct ptrace_peeksiginfo_args.
struct ptrace_peeksiginfo_args {
u64 off; /* from which siginfo to start */
u32 flags;
s32 nr; /* how may siginfos to take */
};
"nr" has type "s32", because ptrace() returns "long", which has 32 bits on
i386 and a negative values is used for errors.
Currently here is only one flag PTRACE_PEEKSIGINFO_SHARED for dumping
signals from process-wide queue. If this flag is not set, signals are
read from a per-thread queue.
The request PTRACE_PEEKSIGINFO returns a number of dumped signals. If a
signal with the specified sequence number doesn't exist, ptrace returns
zero. The request returns an error, if no signal has been dumped.
Errors:
EINVAL - one or more specified flags are not supported or nr is negative
EFAULT - buf or addr is outside your accessible address space.
A result siginfo contains a kernel part of si_code which usually striped,
but it's required for queuing the same siginfo back during restore of
pending signals.
This functionality is required for checkpointing pending signals. Pedro
Alves suggested using it in "gdb" to peek at pending signals. gdb already
uses PTRACE_GETSIGINFO to get the siginfo for the signal which was already
dequeued. This functionality allows gdb to look at the pending signals
which were not reported yet.
The prototype of this code was developed by Oleg Nesterov.
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pedro Alves <palves@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton noted:
akpm3:/usr/src/25> grep SYSCALL kernel/timer.c
SYSCALL_DEFINE1(alarm, unsigned int, seconds)
SYSCALL_DEFINE0(getpid)
SYSCALL_DEFINE0(getppid)
SYSCALL_DEFINE0(getuid)
SYSCALL_DEFINE0(geteuid)
SYSCALL_DEFINE0(getgid)
SYSCALL_DEFINE0(getegid)
SYSCALL_DEFINE0(gettid)
SYSCALL_DEFINE1(sysinfo, struct sysinfo __user *, info)
COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo __user *, info)
Only one of those should be in kernel/timer.c. Who wrote this thing?
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The only use outside of kernel/timer.c was in kernel/compat.c, so move
compat_sys_sysinfo() next to sys_sysinfo() in kernel/timer.c.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We sometimes use "struct call_single_data *data" and sometimes "struct
call_single_data *csd". Use "csd" consistently.
We sometimes use "struct call_function_data *data" and sometimes "struct
call_function_data *cfd". Use "cfd" consistently.
Also, avoid some 80-col layout tricks.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The 'priv' field is redundant; we can pass data via 'info'.
Signed-off-by: liguang <lig.fnst@cn.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
csd_lock() uses assignment to data->flags rather than |=. That is not
buggy at present because only one bit (CSD_FLAG_LOCK) is defined in
call_single_data.flags.
But it will become buggy if we later add another flag, so fix it now.
Signed-off-by: liguang <lig.fnst@cn.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One of the problems that arise when converting dedicated custom
threadpool to workqueue is that the shared worker pool used by workqueue
anonimizes each worker making it more difficult to identify what the
worker was doing on which target from the output of sysrq-t or debug
dump from oops, BUG() and friends.
This patch implements set_worker_desc() which can be called from any
workqueue work function to set its description. When the worker task is
dumped for whatever reason - sysrq-t, WARN, BUG, oops, lockdep assertion
and so on - the description will be printed out together with the
workqueue name and the worker function pointer.
The printing side is implemented by print_worker_info() which is called
from functions in task dump paths - sched_show_task() and
dump_stack_print_info(). print_worker_info() can be safely called on
any task in any state as long as the task struct itself is accessible.
It uses probe_*() functions to access worker fields. It may print
garbage if something went very wrong, but it wouldn't cause (another)
oops.
The description is currently limited to 24bytes including the
terminating \0. worker->desc_valid and workder->desc[] are added and
the 64 bytes marker which was already incorrect before adding the new
fields is moved to the correct position.
Here's an example dump with writeback updated to set the bdi name as
worker desc.
Hardware name: Bochs
Modules linked in:
Pid: 7, comm: kworker/u9:0 Not tainted 3.9.0-rc1-work+ #1
Workqueue: writeback bdi_writeback_workfn (flush-8:0)
ffffffff820a3ab0 ffff88000f6e9cb8 ffffffff81c61845 ffff88000f6e9cf8
ffffffff8108f50f 0000000000000000 0000000000000000 ffff88000cde16b0
ffff88000cde1aa8 ffff88001ee19240 ffff88000f6e9fd8 ffff88000f6e9d08
Call Trace:
[<ffffffff81c61845>] dump_stack+0x19/0x1b
[<ffffffff8108f50f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff8108f56a>] warn_slowpath_null+0x1a/0x20
[<ffffffff81200150>] bdi_writeback_workfn+0x2a0/0x3b0
...
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One of the problems that arise when converting dedicated custom threadpool
to workqueue is that the shared worker pool used by workqueue anonimizes
each worker making it more difficult to identify what the worker was doing
on which target from the output of sysrq-t or debug dump from oops, BUG()
and friends.
For example, after writeback is converted to use workqueue instead of
priviate thread pool, there's no easy to tell which backing device a
writeback work item was working on at the time of task dump, which,
according to our writeback brethren, is important in tracking down issues
with a lot of mounted file systems on a lot of different devices.
This patchset implements a way for a work function to mark its execution
instance so that task dump of the worker task includes information to
indicate what the work item was doing.
An example WARN dump would look like the following.
WARNING: at fs/fs-writeback.c:1015 bdi_writeback_workfn+0x2b4/0x3c0()
Modules linked in:
CPU: 0 Pid: 28 Comm: kworker/u18:0 Not tainted 3.9.0-rc1-work+ #24
Hardware name: empty empty/S3992, BIOS 080011 10/26/2007
Workqueue: writeback bdi_writeback_workfn (flush-8:16)
ffffffff820a3a98 ffff88015b927cb8 ffffffff81c61855 ffff88015b927cf8
ffffffff8108f500 0000000000000000 ffff88007a171948 ffff88007a1716b0
ffff88015b49df00 ffff88015b8d3940 0000000000000000 ffff88015b927d08
Call Trace:
[<ffffffff81c61855>] dump_stack+0x19/0x1b
[<ffffffff8108f500>] warn_slowpath_common+0x70/0xa0
...
This patch:
Implement probe_kthread_data() which returns kthread_data if accessible.
The function is equivalent to kthread_data() except that the specified
@task may not be a kthread or its vfork_done is already cleared rendering
struct kthread inaccessible. In the former case, probe_kthread_data() may
return any value. In the latter, NULL.
This will be used to safely print debug information without affecting
synchronization in the normal paths. Workqueue debug info printing on
dump_stack() and friends will make use of it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show_regs() is inherently arch-dependent but it does make sense to print
generic debug information and some archs already do albeit in slightly
different forms. This patch introduces a generic function to print debug
information from show_regs() so that different archs print out the same
information and it's much easier to modify what's printed.
show_regs_print_info() prints out the same debug info as dump_stack()
does plus task and thread_info pointers.
* Archs which didn't print debug info now do.
alpha, arc, blackfin, c6x, cris, frv, h8300, hexagon, ia64, m32r,
metag, microblaze, mn10300, openrisc, parisc, score, sh64, sparc,
um, xtensa
* Already prints debug info. Replaced with show_regs_print_info().
The printed information is superset of what used to be there.
arm, arm64, avr32, mips, powerpc, sh32, tile, unicore32, x86
* s390 is special in that it used to print arch-specific information
along with generic debug info. Heiko and Martin think that the
arch-specific extra isn't worth keeping s390 specfic implementation.
Converted to use the generic version.
Note that now all archs print the debug info before actual register
dumps.
An example BUG() dump follows.
kernel BUG at /work/os/work/kernel/workqueue.c:4841!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #7
Hardware name: empty empty/S3992, BIOS 080011 10/26/2007
task: ffff88007c85e040 ti: ffff88007c860000 task.ti: ffff88007c860000
RIP: 0010:[<ffffffff8234a07e>] [<ffffffff8234a07e>] init_workqueues+0x4/0x6
RSP: 0000:ffff88007c861ec8 EFLAGS: 00010246
RAX: ffff88007c861fd8 RBX: ffffffff824466a8 RCX: 0000000000000001
RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffffffff8234a07a
RBP: ffff88007c861ec8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff8234a07a
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff88015f7ff000 CR3: 00000000021f1000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffff88007c861ef8 ffffffff81000312 ffffffff824466a8 ffff88007c85e650
0000000000000003 0000000000000000 ffff88007c861f38 ffffffff82335e5d
ffff88007c862080 ffffffff8223d8c0 ffff88007c862080 ffffffff81c47760
Call Trace:
[<ffffffff81000312>] do_one_initcall+0x122/0x170
[<ffffffff82335e5d>] kernel_init_freeable+0x9b/0x1c8
[<ffffffff81c47760>] ? rest_init+0x140/0x140
[<ffffffff81c4776e>] kernel_init+0xe/0xf0
[<ffffffff81c6be9c>] ret_from_fork+0x7c/0xb0
[<ffffffff81c47760>] ? rest_init+0x140/0x140
...
v2: Typo fix in x86-32.
v3: CPU number dropped from show_regs_print_info() as
dump_stack_print_info() has been updated to print it. s390
specific implementation dropped as requested by s390 maintainers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Chris Metcalf <cmetcalf@tilera.com> [tile bits]
Acked-by: Richard Kuo <rkuo@codeaurora.org> [hexagon bits]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
x86 and ia64 can acquire extra hardware identification information
from DMI and print it along with task dumps; however, the usage isn't
consistent.
* x86 show_regs() collects vendor, product and board strings and print
them out with PID, comm and utsname. Some of the information is
printed again later in the same dump.
* warn_slowpath_common() explicitly accesses the DMI board and prints
it out with "Hardware name:" label. This applies to both x86 and
ia64 but is irrelevant on all other archs.
* ia64 doesn't show DMI information on other non-WARN dumps.
This patch introduces arch-specific hardware description used by
dump_stack(). It can be set by calling dump_stack_set_arch_desc()
during boot and, if exists, printed out in a separate line with
"Hardware name:" label.
dmi_set_dump_stack_arch_desc() is added which sets arch-specific
description from DMI data. It uses dmi_ids_string[] which is set from
dmi_present() used for DMI debug message. It is superset of the
information x86 show_regs() is using. The function is called from x86
and ia64 boot code right after dmi_scan_machine().
This makes the explicit DMI handling in warn_slowpath_common()
unnecessary. Removed.
show_regs() isn't yet converted to use generic debug information
printing and this patch doesn't remove the duplicate DMI handling in
x86 show_regs(). The next patch will unify show_regs() handling and
remove the duplication.
An example WARN dump follows.
WARNING: at kernel/workqueue.c:4841 init_workqueues+0x35/0x505()
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #3
Hardware name: empty empty/S3992, BIOS 080011 10/26/2007
0000000000000009 ffff88007c861e08 ffffffff81c614dc ffff88007c861e48
ffffffff8108f500 ffffffff82228240 0000000000000040 ffffffff8234a08e
0000000000000000 0000000000000000 0000000000000000 ffff88007c861e58
Call Trace:
[<ffffffff81c614dc>] dump_stack+0x19/0x1b
[<ffffffff8108f500>] warn_slowpath_common+0x70/0xa0
[<ffffffff8108f54a>] warn_slowpath_null+0x1a/0x20
[<ffffffff8234a0c3>] init_workqueues+0x35/0x505
...
v2: Use the same string as the debug message from dmi_present() which
also contains BIOS information. Move hardware name into its own
line as warn_slowpath_common() did. This change was suggested by
Bjorn Helgaas.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both dump_stack() and show_stack() are currently implemented by each
architecture. show_stack(NULL, NULL) dumps the backtrace for the
current task as does dump_stack(). On some archs, dump_stack() prints
extra information - pid, utsname and so on - in addition to the
backtrace while the two are identical on other archs.
The usages in arch-independent code of the two functions indicate
show_stack(NULL, NULL) should print out bare backtrace while
dump_stack() is used for debugging purposes when something went wrong,
so it does make sense to print additional information on the task which
triggered dump_stack().
There's no reason to require archs to implement two separate but mostly
identical functions. It leads to unnecessary subtle information.
This patch expands the dummy fallback dump_stack() implementation in
lib/dump_stack.c such that it prints out debug information (taken from
x86) and invokes show_stack(NULL, NULL) and drops arch-specific
dump_stack() implementations in all archs except blackfin. Blackfin's
dump_stack() does something wonky that I don't understand.
Debug information can be printed separately by calling
dump_stack_print_info() so that arch-specific dump_stack()
implementation can still emit the same debug information. This is used
in blackfin.
This patch brings the following behavior changes.
* On some archs, an extra level in backtrace for show_stack() could be
printed. This is because the top frame was determined in
dump_stack() on those archs while generic dump_stack() can't do that
reliably. It can be compensated by inlining dump_stack() but not
sure whether that'd be necessary.
* Most archs didn't use to print debug info on dump_stack(). They do
now.
An example WARN dump follows.
WARNING: at kernel/workqueue.c:4841 init_workqueues+0x35/0x505()
Hardware name: empty
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #9
0000000000000009 ffff88007c861e08 ffffffff81c614dc ffff88007c861e48
ffffffff8108f50f ffffffff82228240 0000000000000040 ffffffff8234a03c
0000000000000000 0000000000000000 0000000000000000 ffff88007c861e58
Call Trace:
[<ffffffff81c614dc>] dump_stack+0x19/0x1b
[<ffffffff8108f50f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff8108f56a>] warn_slowpath_null+0x1a/0x20
[<ffffffff8234a071>] init_workqueues+0x35/0x505
...
v2: CPU number added to the generic debug info as requested by s390
folks and dropped the s390 specific dump_stack(). This loses %ksp
from the debug message which the maintainers think isn't important
enough to keep the s390-specific dump_stack() implementation.
dump_stack_print_info() is moved to kernel/printk.c from
lib/dump_stack.c. Because linkage is per objecct file,
dump_stack_print_info() living in the same lib file as generic
dump_stack() means that archs which implement custom dump_stack()
- at this point, only blackfin - can't use dump_stack_print_info()
as that will bring in the generic version of dump_stack() too. v1
The v1 patch broke build on blackfin due to this issue. The build
breakage was reported by Fengguang Wu.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [s390 bits]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Richard Kuo <rkuo@codeaurora.org> [hexagon bits]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Also replace deprecated printk(KERN_ERR...) with pr_err() as suggested
by Yinghai, attaching the function name to provide plenty info.
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull security subsystem update from James Morris:
"Just some minor updates across the subsystem"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
ima: eliminate passing d_name.name to process_measurement()
TPM: Retry SaveState command in suspend path
tpm/tpm_i2c_infineon: Add small comment about return value of __i2c_transfer
tpm/tpm_i2c_infineon.c: Add OF attributes type and name to the of_device_id table entries
tpm_i2c_stm_st33: Remove duplicate inclusion of header files
tpm: Add support for new Infineon I2C TPM (SLB 9645 TT 1.2 I2C)
char/tpm: Convert struct i2c_msg initialization to C99 format
drivers/char/tpm/tpm_ppi: use strlcpy instead of strncpy
tpm/tpm_i2c_stm_st33: formatting and white space changes
Smack: include magic.h in smackfs.c
selinux: make security_sb_clone_mnt_opts return an error on context mismatch
seccomp: allow BPF_XOR based ALU instructions.
Fix NULL pointer dereference in smack_inode_unlink() and smack_inode_rmdir()
Smack: add support for modification of existing rules
smack: SMACK_MAGIC to include/uapi/linux/magic.h
Smack: add missing support for transmute bit in smack_str_from_perm()
Smack: prevent revoke-subject from failing when unseen label is written to it
tomoyo: use DEFINE_SRCU() to define tomoyo_ss
tomoyo: use DEFINE_SRCU() to define tomoyo_ss
- ARM big.LITTLE cpufreq driver from Viresh Kumar.
- exynos5440 cpufreq driver from Amit Daniel Kachhap.
- cpufreq core cleanup and code consolidation from Viresh Kumar and
Stratos Karafotis.
- cpufreq scalability improvement from Nathan Zimmer.
- AMD "frequency sensitivity feedback" powersave bias for the ondemand
cpufreq governor from Jacob Shin.
- cpuidle code consolidation and cleanups from Daniel Lezcano.
- ARM OMAP cpuidle fixes from Santosh Shilimkar and Daniel Lezcano.
- ACPICA fixes and other improvements from Bob Moore, Jung-uk Kim,
Lv Zheng, Yinghai Lu, Tang Chen, Colin Ian King, and Linn Crosetto.
- ACPI core updates related to hotplug from Toshi Kani, Paul Bolle,
Yasuaki Ishimatsu, and Rafael J. Wysocki.
- Intel Lynxpoint LPSS (Low-Power Subsystem) support improvements
from Rafael J. Wysocki and Andy Shevchenko.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABAgAGBQJRf8M8AAoJEKhOf7ml8uNsud4P/3cabXP5lDipzibRrpOiONse
puuvIdhtNdMRMc3t1oSDjNH/w/JA51Gc+ICGFAORiyVmqxBc85mpT6J5ibqV7hNd
pCqbKJceoB5PajHZSx22e4wG9O7YN1k3r80p38IfFzA+Ct0KNSuE0ixMEfHKYjiq
p5pXswk6TY3gtBReH9agrafHqDtXw4IMTE0asMuJ+BorPW7vQeiNlrkuA+0qmDuu
26O0Pm2TVkx1ryfTjdM9zSZ9X2G4JuM8rm1/VFZWQJTExwlv3bA2Za1nvQNJlJ99
6JZ0JXfAehcEW2Ye0sqsZ8HSEabDVHM29QvvOszJ5RpBXERiOCHOkhvFleCoTpn0
Xq0rtXPrLMH1G28Ej+cxmsAjfzOLV2Byg30CAoI/GCLuQ+xh+VMCpuNYQuld25CG
9rtYd0fWESeYsAebhDcX0E3xyzJtbrHtOb9PyGwNkbAJ8YQfhVSMCOPi2SX2wa+Q
qXLXi2VaHvjBSUKcAv5BmM+Ya57Be+88D0LxbgXbUeOnYefUK1ljldKDDshkMjgG
P4LPdm4JpoB5ncXSOO1Dz9w9QnNcFexSUySd/TtKLNMha1vEHV8ISzNPYY+9IdXf
XN0VZbFnUDzdj+Fwna7zyFb1cGihDYJKAtpXvRd8Y6RGUxKx9uGLAFJZw/xZB/cR
KZKuML5O8MgJuef37F38
=H/se
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI updates from Rafael J Wysocki:
- ARM big.LITTLE cpufreq driver from Viresh Kumar.
- exynos5440 cpufreq driver from Amit Daniel Kachhap.
- cpufreq core cleanup and code consolidation from Viresh Kumar and
Stratos Karafotis.
- cpufreq scalability improvement from Nathan Zimmer.
- AMD "frequency sensitivity feedback" powersave bias for the ondemand
cpufreq governor from Jacob Shin.
- cpuidle code consolidation and cleanups from Daniel Lezcano.
- ARM OMAP cpuidle fixes from Santosh Shilimkar and Daniel Lezcano.
- ACPICA fixes and other improvements from Bob Moore, Jung-uk Kim, Lv
Zheng, Yinghai Lu, Tang Chen, Colin Ian King, and Linn Crosetto.
- ACPI core updates related to hotplug from Toshi Kani, Paul Bolle,
Yasuaki Ishimatsu, and Rafael J Wysocki.
- Intel Lynxpoint LPSS (Low-Power Subsystem) support improvements from
Rafael J Wysocki and Andy Shevchenko.
* tag 'pm+acpi-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (192 commits)
cpufreq: Revert incorrect commit 5800043
cpufreq: MAINTAINERS: Add co-maintainer
cpuidle: add maintainer entry
ACPI / thermal: do not always return THERMAL_TREND_RAISING for active trip points
ARM: s3c64xx: cpuidle: use init/exit common routine
cpufreq: pxa2xx: initialize variables
ACPI: video: correct acpi_video_bus_add error processing
SH: cpuidle: use init/exit common routine
ARM: S5pv210: compiling issue, ARM_S5PV210_CPUFREQ needs CONFIG_CPU_FREQ_TABLE=y
ACPI: Fix wrong parameter passed to memblock_reserve
cpuidle: fix comment format
pnp: use %*phC to dump small buffers
isapnp: remove debug leftovers
ARM: imx: cpuidle: use init/exit common routine
ARM: davinci: cpuidle: use init/exit common routine
ARM: kirkwood: cpuidle: use init/exit common routine
ARM: calxeda: cpuidle: use init/exit common routine
ARM: tegra: cpuidle: use init/exit common routine for tegra3
ARM: tegra: cpuidle: use init/exit common routine for tegra2
ARM: OMAP4: cpuidle: use init/exit common routine
...
The userspace audit tools didn't like the existing formatting of the
AUDIT_ANOM_LINK event. It needed to be expanded to emit an AUDIT_PATH
event as well, so this implements the change. The bulk of the patch is
moving code out of auditsc.c into audit.c and audit.h for general use.
It expands audit_log_name to include an optional "struct path" argument
for the simple case of just needing to report a pathname. This also
makes
audit_log_task_info available when syscall auditing is not enabled,
since
it is needed in either case for process details.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Steve Grubb <sgrubb@redhat.com>
This function is called when we receive a netlink message from
userspace. We don't need to worry about it coming from irq context or
irqs making it re-entrant.
Signed-off-by: Eric Paris <eparis@redhat.com>
Most commands are entered one line at a time and processed as complete lines
in non-canonical mode. Commands that interactively require a password, enter
canonical mode to do this while shutting off echo. This pair of features
(icanon and !echo) can be used to avoid logging passwords by audit while still
logging the rest of the command.
Adding a member (log_passwd) to the struct audit_tty_status passed in by
pam_tty_audit allows control of canonical mode without echo per task.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Some of the callers of the audit tty function use spin_lock_irqsave/restore.
We were using the forced always enable version, which seems really bad.
Since I don't know every one of these code paths well enough, it makes
sense to just switch everything to the safe version. Maybe it's a
little overzealous, but it's a lot better than an unlucky deadlock when
we return to a caller with irq enabled and they expect it to be
disabled.
Signed-off-by: Eric Paris <eparis@redhat.com>
We have a number of places we were reimplementing the same code to write
out lsm labels. Just do it one darn place.
Signed-off-by: Eric Paris <eparis@redhat.com>
Since we are always current, we can push a lot of this stuff to the
bottom and get rid of useless interfaces and arguments.
Signed-off-by: Eric Paris <eparis@redhat.com>
We always use current. Stop pulling this when the skb comes in and
pushing it around as arguments. Just get it at the end when you need
it.
Signed-off-by: Eric Paris <eparis@redhat.com>
We used to have an inflexible mechanism to add audit rules to the
kernel. It hasn't been used in a long time. Get rid of that stuff.
Signed-off-by: Eric Paris <eparis@redhat.com>
We have 2 interfaces to send audit rules. Rather than check validity of
things in 2 places make a helper function.
Signed-off-by: Eric Paris <eparis@redhat.com>
Dave Hansen reported strange utime/stime values on his system:
https://lkml.org/lkml/2013/4/4/435
This happens because prev->stime value is bigger than rtime
value. Root of the problem are non-monotonic rtime values (i.e.
current rtime is smaller than previous rtime) and that should be
debugged and fixed.
But since problem did not manifest itself before commit
62188451f0 "cputime: Avoid
multiplication overflow on utime scaling", it should be threated
as regression, which we can easily fixed on cputime_adjust()
function.
For now, let's apply this fix, but further work is needed to fix
root of the problem.
Reported-and-tested-by: Dave Hansen <dave@sr71.net>
Cc: <stable@vger.kernel.org> # 3.9+
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: rostedt@goodmis.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1367314507-9728-3-git-send-email-sgruszka@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Due to rounding in scale_stime(), for big numbers, scaled stime
values will grow in chunks. Since rtime grow in jiffies and we
calculate utime like below:
prev->stime = max(prev->stime, stime);
prev->utime = max(prev->utime, rtime - prev->stime);
we could erroneously account stime values as utime. To prevent
that only update prev->{u,s}time values when they are smaller
than current rtime.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: rostedt@goodmis.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1367314507-9728-2-git-send-email-sgruszka@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Here is patch, which adds Linus's cputime scaling algorithm to the
kernel.
This is a follow up (well, fix) to commit
d9a3c9823a ("sched: Lower chances
of cputime scaling overflow") which commit tried to avoid
multiplication overflow, but did not guarantee that the overflow
would not happen.
Linus crated a different algorithm, which completely avoids the
multiplication overflow by dropping precision when numbers are
big.
It was tested by me and it gives good relative error of
scaled numbers. Testing method is described here:
http://marc.info/?l=linux-kernel&m=136733059505406&w=2
Originally-From: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: rostedt@goodmis.org
Cc: Dave Hansen <dave@sr71.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130430151441.GC10465@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull extable dmesg fixlet from Ingo Molnar:
"Small tweak to reduce kmsg boot time spam"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
extable: Flip the sorting message
Pull core timer updates from Ingo Molnar:
"The main changes in this cycle's merge are:
- Implement shadow timekeeper to shorten in kernel reader side
blocking, by Thomas Gleixner.
- Posix timers enhancements by Pavel Emelyanov:
- allocate timer ID per process, so that exact timer ID allocations
can be re-created be checkpoint/restore code.
- debuggability and tooling (/proc/PID/timers, etc.) improvements.
- suspend/resume enhancements by Feng Tang: on certain new Intel Atom
processors (Penwell and Cloverview), there is a feature that the
TSC won't stop in S3 state, so the TSC value won't be reset to 0
after resume. This can be taken advantage of by the generic via
the CLOCK_SOURCE_SUSPEND_NONSTOP flag: instead of using the RTC to
recover/approximate sleep time, the main (and precise) clocksource
can be used.
- Fix /proc/timer_list for 4096 CPUs by Nathan Zimmer: on so many
CPUs the file goes beyond 4MB of size and thus the current
simplistic seqfile approach fails. Convert /proc/timer_list to a
proper seq_file with its own iterator.
- Cleanups and refactorings of the core timekeeping code by John
Stultz.
- International Atomic Clock time is managed by the NTP code
internally currently but not exposed externally. Separate the TAI
code out and add CLOCK_TAI support and TAI support to the hrtimer
and posix-timer code, by John Stultz.
- Add deep idle support enhacement to the broadcast clockevents core
timer code, by Daniel Lezcano: add an opt-in CLOCK_EVT_FEAT_DYNIRQ
clockevents feature (which will be utilized by future clockevents
driver updates), which allows the use of IRQ affinities to avoid
spurious wakeups of idle CPUs - the right CPU with an expiring
timer will be woken.
- Add new ARM bcm281xx clocksource driver, by Christian Daudt
- ... various other fixes and cleanups"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
clockevents: Set dummy handler on CPU_DEAD shutdown
timekeeping: Update tk->cycle_last in resume
posix-timers: Remove unused variable
clockevents: Switch into oneshot mode even if broadcast registered late
timer_list: Convert timer list to be a proper seq_file
timer_list: Split timer_list_show_tickdevices
posix-timers: Show sigevent info in proc file
posix-timers: Introduce /proc/PID/timers file
posix timers: Allocate timer id per process (v2)
timekeeping: Make sure to notify hrtimers when TAI offset changes
hrtimer: Fix ktime_add_ns() overflow on 32bit architectures
hrtimer: Add expiry time overflow check in hrtimer_interrupt
timekeeping: Shorten seq_count region
timekeeping: Implement a shadow timekeeper
timekeeping: Delay update of clock->cycle_last
timekeeping: Store cycle_last value in timekeeper struct as well
ntp: Remove ntp_lock, using the timekeeping locks to protect ntp state
timekeeping: Simplify tai updating from do_adjtimex
timekeeping: Hold timekeepering locks in do_adjtimex and hardpps
timekeeping: Move ADJ_SETOFFSET to top level do_adjtimex()
...
Pull SMP/hotplug changes from Ingo Molnar:
"This is a pretty large, multi-arch series unifying and generalizing
the various disjunct pieces of idle routines that architectures have
historically copied from each other and have grown in random, wildly
inconsistent and sometimes buggy directions:
101 files changed, 455 insertions(+), 1328 deletions(-)
this went through a number of review and test iterations before it was
committed, it was tested on various architectures, was exposed to
linux-next for quite some time - nevertheless it might cause problems
on architectures that don't read the mailing lists and don't regularly
test linux-next.
This cat herding excercise was motivated by the -rt kernel, and was
brought to you by Thomas "the Whip" Gleixner."
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
idle: Remove GENERIC_IDLE_LOOP config switch
um: Use generic idle loop
ia64: Make sure interrupts enabled when we "safe_halt()"
sparc: Use generic idle loop
idle: Remove unused ARCH_HAS_DEFAULT_IDLE
bfin: Fix typo in arch_cpu_idle()
xtensa: Use generic idle loop
x86: Use generic idle loop
unicore: Use generic idle loop
tile: Use generic idle loop
tile: Enter idle with preemption disabled
sh: Use generic idle loop
score: Use generic idle loop
s390: Use generic idle loop
powerpc: Use generic idle loop
parisc: Use generic idle loop
openrisc: Use generic idle loop
mn10300: Use generic idle loop
mips: Use generic idle loop
microblaze: Use generic idle loop
...
Pull scheduler changes from Ingo Molnar:
"The main changes in this development cycle were:
- full dynticks preparatory work by Frederic Weisbecker
- factor out the cpu time accounting code better, by Li Zefan
- multi-CPU load balancer cleanups and improvements by Joonsoo Kim
- various smaller fixes and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
sched: Fix init NOHZ_IDLE flag
sched: Prevent to re-select dst-cpu in load_balance()
sched: Rename load_balance_tmpmask to load_balance_mask
sched: Move up affinity check to mitigate useless redoing overhead
sched: Don't consider other cpus in our group in case of NEWLY_IDLE
sched: Explicitly cpu_idle_type checking in rebalance_domains()
sched: Change position of resched_cpu() in load_balance()
sched: Fix wrong rq's runnable_avg update with rt tasks
sched: Document task_struct::personality field
sched/cpuacct/UML: Fix header file dependency bug on the UML build
cgroup: Kill subsys.active flag
sched/cpuacct: No need to check subsys active state
sched/cpuacct: Initialize cpuacct subsystem earlier
sched/cpuacct: Initialize root cpuacct earlier
sched/cpuacct: Allocate per_cpu cpuusage for root cpuacct statically
sched/cpuacct: Clean up cpuacct.h
sched/cpuacct: Remove redundant NULL checks in cpuacct_acount_field()
sched/cpuacct: Remove redundant NULL checks in cpuacct_charge()
sched/cpuacct: Add cpuacct_acount_field()
sched/cpuacct: Add cpuacct_init()
...
Pull perf updates from Ingo Molnar:
"Features:
- Add "uretprobes" - an optimization to uprobes, like kretprobes are
an optimization to kprobes. "perf probe -x file sym%return" now
works like kretprobes. By Oleg Nesterov.
- Introduce per core aggregation in 'perf stat', from Stephane
Eranian.
- Add memory profiling via PEBS, from Stephane Eranian.
- Event group view for 'annotate' in --stdio, --tui and --gtk, from
Namhyung Kim.
- Add support for AMD NB and L2I "uncore" counters, by Jacob Shin.
- Add Ivy Bridge-EP uncore support, by Zheng Yan
- IBM zEnterprise EC12 oprofile support patchlet from Robert Richter.
- Add perf test entries for checking breakpoint overflow signal
handler issues, from Jiri Olsa.
- Add perf test entry for for checking number of EXIT events, from
Namhyung Kim.
- Add perf test entries for checking --cpu in record and stat, from
Jiri Olsa.
- Introduce perf stat --repeat forever, from Frederik Deweerdt.
- Add --no-demangle to report/top, from Namhyung Kim.
- PowerPC fixes plus a couple of cleanups/optimizations in uprobes
and trace_uprobes, by Oleg Nesterov.
Various fixes and refactorings:
- Fix dependency of the python binding wrt libtraceevent, from
Naohiro Aota.
- Simplify some perf_evlist methods and to allow 'stat' to share code
with 'record' and 'trace', by Arnaldo Carvalho de Melo.
- Remove dead code in related to libtraceevent integration, from
Namhyung Kim.
- Revert "perf sched: Handle PERF_RECORD_EXIT events" to get 'perf
sched lat' back working, by Arnaldo Carvalho de Melo
- We don't use Newt anymore, just plain libslang, by Arnaldo Carvalho
de Melo.
- Kill a bunch of die() calls, from Namhyung Kim.
- Fix build on non-glibc systems due to libio.h absence, from Cody P
Schafer.
- Remove some perf_session and tracing dead code, from David Ahern.
- Honor parallel jobs, fix from Borislav Petkov
- Introduce tools/lib/lk library, initially just removing duplication
among tools/perf and tools/vm. from Borislav Petkov
... and many more I missed to list, see the shortlog and git log for
more details."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (136 commits)
perf/x86/intel/P4: Robistify P4 PMU types
perf/x86/amd: Fix AMD NB and L2I "uncore" support
perf/x86/amd: Remove old-style NB counter support from perf_event_amd.c
perf/x86: Check all MSRs before passing hw check
perf/x86/amd: Add support for AMD NB and L2I "uncore" counters
perf/x86/intel: Add Ivy Bridge-EP uncore support
perf/x86/intel: Fix SNB-EP CBO and PCU uncore PMU filter management
perf/x86: Avoid kfree() in CPU_{STARTING,DYING}
uprobes/perf: Avoid perf_trace_buf_prepare/submit if ->perf_events is empty
uprobes/tracing: Don't pass addr=ip to perf_trace_buf_submit()
uprobes/tracing: Change create_trace_uprobe() to support uretprobes
uprobes/tracing: Make seq_printf() code uretprobe-friendly
uprobes/tracing: Make register_uprobe_event() paths uretprobe-friendly
uprobes/tracing: Make uprobe_{trace,perf}_print() uretprobe-friendly
uprobes/tracing: Introduce is_ret_probe() and uretprobe_dispatcher()
uprobes/tracing: Introduce uprobe_{trace,perf}_print() helpers
uprobes/tracing: Generalize struct uprobe_trace_entry_head
uprobes/tracing: Kill the pointless local_save_flags/preempt_count calls
uprobes/tracing: Kill the pointless seq_print_ip_sym() call
uprobes/tracing: Kill the pointless task_pt_regs() calls
...
Pull RCU updates from Ingo Molnar:
"The main changes in this cycle are mostly related to preparatory work
for the full-dynticks work:
- Remove restrictions on no-CBs CPUs, make RCU_FAST_NO_HZ take
advantage of numbered callbacks, do callback accelerations based on
numbered callbacks. Posted to LKML at
https://lkml.org/lkml/2013/3/18/960
- RCU documentation updates. Posted to LKML at
https://lkml.org/lkml/2013/3/18/570
- Miscellaneous fixes. Posted to LKML at
https://lkml.org/lkml/2013/3/18/594"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
rcu: Make rcu_accelerate_cbs() note need for future grace periods
rcu: Abstract rcu_start_future_gp() from rcu_nocb_wait_gp()
rcu: Rename n_nocb_gp_requests to need_future_gp
rcu: Push lock release to rcu_start_gp()'s callers
rcu: Repurpose no-CBs event tracing to future-GP events
rcu: Rearrange locking in rcu_start_gp()
rcu: Make RCU_FAST_NO_HZ take advantage of numbered callbacks
rcu: Accelerate RCU callbacks at grace-period end
rcu: Export RCU_FAST_NO_HZ parameters to sysfs
rcu: Distinguish "rcuo" kthreads by RCU flavor
rcu: Add event tracing for no-CBs CPUs' grace periods
rcu: Add event tracing for no-CBs CPUs' callback registration
rcu: Introduce proper blocking to no-CBs kthreads GP waits
rcu: Provide compile-time control for no-CBs CPUs
rcu: Tone down debugging during boot-up and shutdown.
rcu: Add softirq-stall indications to stall-warning messages
rcu: Documentation update
rcu: Make bugginess of code sample more evident
rcu: Fix hlist_bl_set_first_rcu() annotation
rcu: Delete unused rcu_node "wakemask" field
...
During the 3.10 merge, a conflict happened and the resolution was
almost, but not quite, correct. An if statement was reversed.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
[ Duh. That was just silly of me - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently remove_buf_file callback is called from from kobject
release method. This result in follow issue:
# blktrace -d /dev/sda1 -d /dev/sda -o test
blktrace_setup()
dir = create_dir()
rchan = relay_open(dir,...)
->create_buf_file_callback
buf_file = debugfs_create_file(dir, )
Userspace will open buf_file.
Later we make a decision to stop tracing
blktrace_down()
relay_close(rhcan) /* just decrement kobj reference */
/* since it is not zero then callback not called */
debugfs_remove(dir) /* FAIL due to non empty dir */
Later user space will close the file and file will be deleted,
but directory still exist.
user_space_close()
->file_release
->release_buf_file_callback
->debugfs_remove(buf_file
## TESTCASE:
# blktrace -d /dev/sda1 -d /dev/sda -o test
# After that blktrace infrastructure will remain broken in
# an unusable state so: blktrace -d /dev/sda1 will not work.
In fact this is general issue, blktrace is just one of examples.
We can not reliably remove parent dir until all users close the
buf_file.
Solution: We don't have to wait that long. File should be deleted inside
relay_close_buf().
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge second batch of fixes from Andrew Morton:
- various misc bits
- some printk updates
- a new "SRAM" driver.
- MAINTAINERS updates
- the backlight driver queue
- checkpatch updates
- a few init/ changes
- a huge number of drivers/rtc changes
- fatfs updates
- some lib/idr.c work
- some renaming of the random driver interfaces
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (285 commits)
net: rename random32 to prandom
net/core: remove duplicate statements by do-while loop
net/core: rename random32() to prandom_u32()
net/netfilter: rename random32() to prandom_u32()
net/sched: rename random32() to prandom_u32()
net/sunrpc: rename random32() to prandom_u32()
scsi: rename random32() to prandom_u32()
lguest: rename random32() to prandom_u32()
uwb: rename random32() to prandom_u32()
video/uvesafb: rename random32() to prandom_u32()
mmc: rename random32() to prandom_u32()
drbd: rename random32() to prandom_u32()
kernel/: rename random32() to prandom_u32()
mm/: rename random32() to prandom_u32()
lib/: rename random32() to prandom_u32()
x86: rename random32() to prandom_u32()
x86: pageattr-test: remove srandom32 call
uuid: use prandom_bytes()
raid6test: use prandom_bytes()
sctp: convert sctp_assoc_set_id() to use idr_alloc_cyclic()
...
Pull cgroup updates from Tejun Heo:
- Fixes and a lot of cleanups. Locking cleanup is finally complete.
cgroup_mutex is no longer exposed to individual controlelrs which
used to cause nasty deadlock issues. Li fixed and cleaned up quite a
bit including long standing ones like racy cgroup_path().
- device cgroup now supports proper hierarchy thanks to Aristeu.
- perf_event cgroup now supports proper hierarchy.
- A new mount option "__DEVEL__sane_behavior" is added. As indicated
by the name, this option is to be used for development only at this
point and generates a warning message when used. Unfortunately,
cgroup interface currently has too many brekages and inconsistencies
to implement a consistent and unified hierarchy on top. The new flag
is used to collect the behavior changes which are necessary to
implement consistent unified hierarchy. It's likely that this flag
won't be used verbatim when it becomes ready but will be enabled
implicitly along with unified hierarchy.
The option currently disables some of broken behaviors in cgroup core
and also .use_hierarchy switch in memcg (will be routed through -mm),
which can be used to make very unusual hierarchy where nesting is
partially honored. It will also be used to implement hierarchy
support for blk-throttle which would be impossible otherwise without
introducing a full separate set of control knobs.
This is essentially versioning of interface which isn't very nice but
at this point I can't see any other options which would allow keeping
the interface the same while moving towards hierarchy behavior which
is at least somewhat sane. The planned unified hierarchy is likely
to require some level of adaptation from userland anyway, so I think
it'd be best to take the chance and update the interface such that
it's supportable in the long term.
Maintaining the existing interface does complicate cgroup core but
shouldn't put too much strain on individual controllers and I think
it'd be manageable for the foreseeable future. Maybe we'll be able
to drop it in a decade.
Fix up conflicts (including a semantic one adding a new #include to ppc
that was uncovered by header the file changes) as per Tejun.
* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
cpuset: fix compile warning when CONFIG_SMP=n
cpuset: fix cpu hotplug vs rebuild_sched_domains() race
cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
cgroup: restore the call to eventfd->poll()
cgroup: fix use-after-free when umounting cgroupfs
cgroup: fix broken file xattrs
devcg: remove parent_cgroup.
memcg: force use_hierarchy if sane_behavior
cgroup: remove cgrp->top_cgroup
cgroup: introduce sane_behavior mount option
move cgroupfs_root to include/linux/cgroup.h
cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
cgroup: make cgroup_path() not print double slashes
Revert "cgroup: remove bind() method from cgroup_subsys."
perf: make perf_event cgroup hierarchical
cgroup: implement cgroup_is_descendant()
cgroup: make sure parent won't be destroyed before its children
cgroup: remove bind() method from cgroup_subsys.
devcg: remove broken_hierarchy tag
cgroup: remove cgroup_lock_is_held()
...
Pull workqueue updates from Tejun Heo:
"A lot of activities on workqueue side this time. The changes achieve
the followings.
- WQ_UNBOUND workqueues - the workqueues which are per-cpu - are
updated to be able to interface with multiple backend worker pools.
This involved a lot of churning but the end result seems actually
neater as unbound workqueues are now a lot closer to per-cpu ones.
- The ability to interface with multiple backend worker pools are
used to implement unbound workqueues with custom attributes.
Currently the supported attributes are the nice level and CPU
affinity. It may be expanded to include cgroup association in
future. The attributes can be specified either by calling
apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if
the workqueue in question is exported through sysfs.
The backend worker pools are keyed by the actual attributes and
shared by any workqueues which share the same attributes. When
attributes of a workqueue are changed, the workqueue binds to the
worker pool with the specified attributes while leaving the work
items which are already executing in its previous worker pools
alone.
This allows converting custom worker pool implementations which
want worker attribute tuning to use workqueues. The writeback pool
is already converted in block tree and there are a couple others
are likely to follow including btrfs io workers.
- WQ_UNBOUND's ability to bind to multiple worker pools is also used
to make it NUMA-aware. Because there's no association between work
item issuer and the specific worker assigned to execute it, before
this change, using unbound workqueue led to unnecessary cross-node
bouncing and it couldn't be helped by autonuma as it requires tasks
to have implicit node affinity and workers are assigned randomly.
After these changes, an unbound workqueue now binds to multiple
NUMA-affine worker pools so that queued work items are executed in
the same node. This is turned on by default but can be disabled
system-wide or for individual workqueues.
Crypto was requesting NUMA affinity as encrypting data across
different nodes can contribute noticeable overhead and doing it
per-cpu was too limiting for certain cases and IO throughput could
be bottlenecked by one CPU being fully occupied while others have
idle cycles.
While the new features required a lot of changes including
restructuring locking, it didn't complicate the execution paths much.
The unbound workqueue handling is now closer to per-cpu ones and the
new features are implemented by simply associating a workqueue with
different sets of backend worker pools without changing queue,
execution or flush paths.
As such, even though the amount of change is very high, I feel
relatively safe in that it isn't likely to cause subtle issues with
basic correctness of work item execution and handling. If something
is wrong, it's likely to show up as being associated with worker pools
with the wrong attributes or OOPS while workqueue attributes are being
changed or during CPU hotplug.
While this creates more backend worker pools, it doesn't add too many
more workers unless, of course, there are many workqueues with unique
combinations of attributes. Assuming everything else is the same,
NUMA awareness costs an extra worker pool per NUMA node with online
CPUs.
There are also a couple things which are being routed outside the
workqueue tree.
- block tree pulled in workqueue for-3.10 so that writeback worker
pool can be converted to unbound workqueue with sysfs control
exposed. This simplifies the code, makes writeback workers
NUMA-aware and allows tuning nice level and CPU affinity via sysfs.
- The conversion to workqueue means that there's no 1:1 association
between a specific worker, which makes writeback folks unhappy as
they want to be able to tell which filesystem caused a problem from
backtrace on systems with many filesystems mounted. This is
resolved by allowing work items to set debug info string which is
printed when the task is dumped. As this change involves unifying
implementations of dump_stack() and friends in arch codes, it's
being routed through Andrew's -mm tree."
* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits)
workqueue: use kmem_cache_free() instead of kfree()
workqueue: avoid false negative WARN_ON() in destroy_workqueue()
workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
workqueue: implement NUMA affinity for unbound workqueues
workqueue: introduce put_pwq_unlocked()
workqueue: introduce numa_pwq_tbl_install()
workqueue: use NUMA-aware allocation for pool_workqueues
workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
workqueue: map an unbound workqueues to multiple per-node pool_workqueues
workqueue: move hot fields of workqueue_struct to the end
workqueue: make workqueue->name[] fixed len
workqueue: add workqueue->unbound_attrs
workqueue: determine NUMA node of workers accourding to the allowed cpumask
workqueue: drop 'H' from kworker names of unbound worker pools
workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
workqueue: move pwq_pool_locking outside of get/put_unbound_pool()
workqueue: fix memory leak in apply_workqueue_attrs()
workqueue: fix unbound workqueue attrs hashing / comparison
workqueue: fix race condition in unbound workqueue free path
workqueue: remove pwq_lock which is no longer used
...
Pull async update from Tejun Heo:
"This contains three cleanup patches for async from Lai. All three
patches are essentially cosmetic."
* 'for-3.10-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
async: rename and redefine async_func_ptr
async: remove unused @node from struct async_domain
async: simplify lowest_in_progress()
Use preferable function name which implies using a pseudo-random
number generator.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Error value got overwritten instantly.
Signed-off-by: Nicolas Kaiser <nikai@nikai.net>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The early console implementations are the same all over the place. Move
the print function to kernel/printk and get rid of the copies.
[akpm@linux-foundation.org: arch/mips/kernel/early_printk.c needs kernel.h for va_list]
[paul.gortmaker@windriver.com: sh4: make the bios early console support depend on EARLY_PRINTK]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Richard Weinberger <richard@nod.at>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 7ff9554bb5 ("printk: convert byte-buffer to variable-length
record buffer") removed start and end parameters from
call_console_drivers, but those parameters still exist in
include/trace/events/printk.h.
Without start and end parameters handling, printk tracing became more
simple as: trace_console(text, len);
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge first batch of fixes from Andrew Morton:
- A couple of kthread changes
- A few minor audit patches
- A number of fbdev patches. Florian remains AWOL so I'm picking up
some of these.
- A few kbuild things
- ocfs2 updates
- Almost all of the MM queue
(And in the meantime, I already have the second big batch from Andrew
pending in my mailbox ;^)
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (149 commits)
memcg: take reference before releasing rcu_read_lock
mem hotunplug: fix kfree() of bootmem memory
mmKconfig: add an option to disable bounce
mm, nobootmem: do memset() after memblock_reserve()
mm, nobootmem: clean-up of free_low_memory_core_early()
fs/buffer.c: remove unnecessary init operation after allocating buffer_head.
numa, cpu hotplug: change links of CPU and node when changing node number by onlining CPU
mm: fix memory_hotplug.c printk format warning
mm: swap: mark swap pages writeback before queueing for direct IO
swap: redirty page if page write fails on swap file
mm, memcg: give exiting processes access to memory reserves
thp: fix huge zero page logic for page with pfn == 0
memcg: avoid accessing memcg after releasing reference
fs: fix fsync() error reporting
memblock: fix missing comment of memblock_insert_region()
mm: Remove unused parameter of pages_correctly_reserved()
firmware, memmap: fix firmware_map_entry leak
mm/vmstat: add note on safety of drain_zonestat
mm: thp: add split tail pages to shrink page list in page reclaim
mm: allow for outstanding swap writeback accounting
...
When hot removing memory presented at boot time, following messages are shown:
kernel BUG at mm/slub.c:3409!
invalid opcode: 0000 [#1] SMP
Modules linked in: ebtable_nat ebtables xt_CHECKSUM iptable_mangle bridge stp llc ipmi_devintf ipmi_msghandler sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode pcspkr sg i2c_i801 lpc_ich mfd_core igb i2c_algo_bit i2c_core e1000e ptp pps_core tpm_infineon ioatdma dca sr_mod cdrom sd_mod crc_t10dif usb_storage megaraid_sas lpfc scsi_transport_fc scsi_tgt scsi_mod
CPU 0
Pid: 5091, comm: kworker/0:2 Tainted: G W 3.9.0-rc6+ #15
RIP: kfree+0x232/0x240
Process kworker/0:2 (pid: 5091, threadinfo ffff88084678c000, task ffff88083928ca80)
Call Trace:
__release_region+0xd4/0xe0
__remove_pages+0x52/0x110
arch_remove_memory+0x89/0xd0
remove_memory+0xc4/0x100
acpi_memory_device_remove+0x6d/0xb1
acpi_device_remove+0x89/0xab
__device_release_driver+0x7c/0xf0
device_release_driver+0x2f/0x50
acpi_bus_device_detach+0x6c/0x70
acpi_ns_walk_namespace+0x11a/0x250
acpi_walk_namespace+0xee/0x137
acpi_bus_trim+0x33/0x7a
acpi_bus_hot_remove_device+0xc4/0x1a1
acpi_os_execute_deferred+0x27/0x34
process_one_work+0x1f7/0x590
worker_thread+0x11a/0x370
kthread+0xee/0x100
ret_from_fork+0x7c/0xb0
RIP [<ffffffff811c41d2>] kfree+0x232/0x240
RSP <ffff88084678d968>
The reason why the messages are shown is to release a resource
structure, allocated by bootmem, by kfree(). So when we release a
resource structure, we should check whether it is allocated by bootmem
or not.
But even if we know a resource structure is allocated by bootmem, we
cannot release it since SLxB cannot treat it. So for reusing a resource
structure, this patch remembers it by using bootmem_resource as follows:
When releasing a resource structure by free_resource(), free_resource()
checks whether the resource structure is allocated by bootmem or not.
If it is allocated by bootmem, free_resource() adds it to
bootmem_resource. If it is not allocated by bootmem, free_resource()
release it by kfree().
And when getting a new resource structure by get_resource(),
get_resource() checks whether bootmem_resource has released resource
structures or not. If there is a released resource structure,
get_resource() returns it. If there is not a releaed resource
structure, get_resource() returns new resource structure allocated by
kzalloc().
[akpm@linux-foundation.org: s/get_resource/alloc_resource/]
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add release_mem_region_adjustable(), which releases a requested region
from a currently busy memory resource. This interface adjusts the
matched memory resource accordingly even if the requested region does
not match exactly but still fits into.
This new interface is intended for memory hot-delete. During bootup,
memory resources are inserted from the boot descriptor table, such as
EFI Memory Table and e820. Each memory resource entry usually covers
the whole contigous memory range. Memory hot-delete request, on the
other hand, may target to a particular range of memory resource, and its
size can be much smaller than the whole contiguous memory. Since the
existing release interfaces like __release_region() require a requested
region to be exactly matched to a resource entry, they do not allow a
partial resource to be released.
This new interface is restrictive (i.e. release under certain
conditions), which is consistent with other release interfaces,
__release_region() and __release_resource(). Additional release
conditions, such as an overlapping region to a resource entry, can be
supported after they are confirmed as valid cases.
There is no change to the existing interfaces since their restriction is
valid for I/O resources.
[akpm@linux-foundation.org: use GFP_ATOMIC under write_lock()]
[akpm@linux-foundation.org: switch back to GFP_KERNEL, less buggily]
[akpm@linux-foundation.org: remove unneeded and wrong kfree(), per Toshi]
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by : Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Cc: T Makphaibulchoke <tmac@hp.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add __adjust_resource(), which is called by adjust_resource() internally
after the resource_lock is held. There is no interface change to
adjust_resource(). This change allows other functions to call
__adjust_resource() internally while the resource_lock is held.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: T Makphaibulchoke <tmac@hp.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add an admin_reserve_kbytes knob to allow admins to change the hardcoded
memory reserve to something other than 3%, which may be multiple
gigabytes on large memory systems. Only about 8MB is necessary to
enable recovery in the default mode, and only a few hundred MB are
required even when overcommit is disabled.
This affects OVERCOMMIT_GUESS and OVERCOMMIT_NEVER.
admin_reserve_kbytes is initialized to min(3% free pages, 8MB)
I arrived at 8MB by summing the RSS of sshd or login, bash, and top.
Please see first patch in this series for full background, motivation,
testing, and full changelog.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_admin_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add user_reserve_kbytes knob.
Limit the growth of the memory reserved for other user processes to
min(3% current process size, user_reserve_pages). Only about 8MB is
necessary to enable recovery in the default mode, and only a few hundred
MB are required even when overcommit is disabled.
user_reserve_pages defaults to min(3% free pages, 128MB)
I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
then adding the RSS of each.
This only affects OVERCOMMIT_NEVER mode.
Background
1. user reserve
__vm_enough_memory reserves a hardcoded 3% of the current process size for
other applications when overcommit is disabled. This was done so that a
user could recover if they launched a memory hogging process. Without the
reserve, a user would easily run into a message such as:
bash: fork: Cannot allocate memory
2. admin reserve
Additionally, a hardcoded 3% of free memory is reserved for root in both
overcommit 'guess' and 'never' modes. This was intended to prevent a
scenario where root-cant-log-in and perform recovery operations.
Note that this reserve shrinks, and doesn't guarantee a useful reserve.
Motivation
The two hardcoded memory reserves should be updated to account for current
memory sizes.
Also, the admin reserve would be more useful if it didn't shrink too much.
When the current code was originally written, 1GB was considered
"enterprise". Now the 3% reserve can grow to multiple GB on large memory
systems, and it only needs to be a few hundred MB at most to enable a user
or admin to recover a system with an unwanted memory hogging process.
I've found that reducing these reserves is especially beneficial for a
specific type of application load:
* single application system
* one or few processes (e.g. one per core)
* allocating all available memory
* not initializing every page immediately
* long running
I've run scientific clusters with this sort of load. A long running job
sometimes failed many hours (weeks of CPU time) into a calculation. They
weren't initializing all of their memory immediately, and they weren't
using calloc, so I put systems into overcommit 'never' mode. These
clusters run diskless and have no swap.
However, with the current reserves, a user wishing to allocate as much
memory as possible to one process may be prevented from using, for
example, almost 2GB out of 32GB.
The effect is less, but still significant when a user starts a job with
one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could not
allocate the amount of memory a user would expect to be able to allocate.
For example, Message Passing Interfce (MPI) processes, one per core. And
it is similar for other parallel programming frameworks.
Changing this reserve code will make the overcommit never mode more useful
by allowing applications to allocate nearly all of the available memory.
Also, the new admin_reserve_kbytes will be safer than the current behavior
since the hardcoded 3% of available memory reserve can shrink to something
useless in the case where applications have grabbed all available memory.
Risks
* "bash: fork: Cannot allocate memory"
The downside of the first patch-- which creates a tunable user reserve
that is only used in overcommit 'never' mode--is that an admin can set
it so low that a user may not be able to kill their process, even if
they already have a shell prompt.
Of course, a user can get in the same predicament with the current 3%
reserve--they just have to launch processes until 3% becomes negligible.
* root-cant-log-in problem
The second patch, adding the tunable rootuser_reserve_pages, allows
the admin to shoot themselves in the foot by setting it too small. They
can easily get the system into a state where root-can't-log-in.
However, the new admin_reserve_kbytes will be safer than the current
behavior since the hardcoded 3% of available memory reserve can shrink
to something useless in the case where applications have grabbed all
available memory.
Alternatives
* Memory cgroups provide a more flexible way to limit application memory.
Not everyone wants to set up cgroups or deal with their overhead.
* We could create a fourth overcommit mode which provides smaller reserves.
The size of useful reserves may be drastically different depending
on the whether the system is embedded or enterprise.
* Force users to initialize all of their memory or use calloc.
Some users don't want/expect the system to overcommit when they malloc.
Overcommit 'never' mode is for this scenario, and it should work well.
The new user and admin reserve tunables are simple to use, with low
overhead compared to cgroups. The patches preserve current behavior where
3% of memory is less than 128MB, except that the admin reserve doesn't
shrink to an unusable size under pressure. The code allows admins to tune
for embedded and enterprise usage.
FAQ
* How is the root-cant-login problem addressed?
What happens if admin_reserve_pages is set to 0?
Root is free to shoot themselves in the foot by setting
admin_reserve_kbytes too low.
On x86_64, the minimum useful reserve is:
8MB for overcommit 'guess'
128MB for overcommit 'never'
admin_reserve_pages defaults to min(3% free memory, 8MB)
So, anyone switching to 'never' mode needs to adjust
admin_reserve_pages.
* How do you calculate a minimum useful reserve?
A user or the admin needs enough memory to login and perform
recovery operations, which includes, at a minimum:
sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
For overcommit 'guess', we can sum resident set sizes (RSS)
because we only need enough memory to handle what the recovery
programs will typically use. On x86_64 this is about 8MB.
For overcommit 'never', we can take the max of their virtual sizes (VSZ)
and add the sum of their RSS. We use VSZ instead of RSS because mode
forces us to ensure we can fulfill all of the requested memory allocations--
even if the programs only use a fraction of what they ask for.
On x86_64 this is about 128MB.
When swap is enabled, reserves are useful even when they are as
small as 10MB, regardless of overcommit mode.
When both swap and overcommit are disabled, then the admin should
tune the reserves higher to be absolutley safe. Over 230MB each
was safest in my testing.
* What happens if user_reserve_pages is set to 0?
Note, this only affects overcomitt 'never' mode.
Then a user will be able to allocate all available memory minus
admin_reserve_kbytes.
However, they will easily see a message such as:
"bash: fork: Cannot allocate memory"
And they won't be able to recover/kill their application.
The admin should be able to recover the system if
admin_reserve_kbytes is set appropriately.
* What's the difference between overcommit 'guess' and 'never'?
"Guess" allows an allocation if there are enough free + reclaimable
pages. It has a hardcoded 3% of free pages reserved for root.
"Never" allows an allocation if there is enough swap + a configurable
percentage (default is 50) of physical RAM. It has a hardcoded 3% of
free pages reserved for root, like "Guess" mode. It also has a
hardcoded 3% of the current process size reserved for additional
applications.
* Why is overcommit 'guess' not suitable even when an app eventually
writes to every page? It takes free pages, file pages, available
swap pages, reclaimable slab pages into consideration. In other words,
these are all pages available, then why isn't overcommit suitable?
Because it only looks at the present state of the system. It
does not take into account the memory that other applications have
malloced, but haven't initialized yet. It overcommits the system.
Test Summary
There was little change in behavior in the default overcommit 'guess'
mode with swap enabled before and after the patch. This was expected.
Systems run most predictably (i.e. no oom kills) in overcommit 'never'
mode with swap enabled. This also allowed the most memory to be allocated
to a user application.
Overcommit 'guess' mode without swap is a bad idea. It is easy to
crash the system. None of the other tested combinations crashed.
This matches my experience on the Roadrunner supercomputer.
Without the tunable user reserve, a system in overcommit 'never' mode
and without swap does not allow the admin to recover, although the
admin can.
With the new tunable reserves, a system in overcommit 'never' mode
and without swap can be configured to:
1. maximize user-allocatable memory, running close to the edge of
recoverability
2. maximize recoverability, sacrificing allocatable memory to
ensure that a user cannot take down a system
Test Description
Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
System is booted into multiuser console mode, with unnecessary services
turned off. Caches were dropped before each test.
Hogs are user memtester processes that attempt to allocate all free memory
as reported by /proc/meminfo
In overcommit 'never' mode, memory_ratio=100
Test Results
3.9.0-rc1-mm1
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5432/5432 no yes yes
guess yes 4 5444/5444 1 yes yes
guess no 1 5302/5449 no yes yes
guess no 4 - crash no no
never yes 1 5460/5460 1 yes yes
never yes 4 5460/5460 1 yes yes
never no 1 5218/5432 no no yes
never no 4 5203/5448 no no yes
3.9.0-rc1-mm1-tunablereserves
User and Admin Recovery show their respective reserves, if applicable.
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5419/5419 no - yes 8MB yes
guess yes 4 5436/5436 1 - yes 8MB yes
guess no 1 5440/5440 * - yes 8MB yes
guess no 4 - crash - no 8MB no
* process would successfully mlock, then the oom killer would pick it
never yes 1 5446/5446 no 10MB yes 20MB yes
never yes 4 5456/5456 no 10MB yes 20MB yes
never no 1 5387/5429 no 128MB no 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5359/5448 no 10MB no 10MB barely
never no 1 5323/5428 no 0MB no 10MB barely
never no 1 5332/5428 no 0MB no 50MB yes
never no 1 5293/5429 no 0MB no 90MB yes
never no 1 5001/5427 no 230MB yes 338MB yes
never no 4* 4998/5424 no 230MB yes 338MB yes
* more memtesters were launched, able to allocate approximately another 100MB
Future Work
- Test larger memory systems.
- Test an embedded image.
- Test other architectures.
- Time malloc microbenchmarks.
- Would it be useful to be able to set overcommit policy for
each memory cgroup?
- Some lines are slightly above 80 chars.
Perhaps define a macro to convert between pages and kb?
Other places in the kernel do this.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_user_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the new interface, remove one ifdef. No code size changes.
We could/should have been using __meminit/__meminitdata here but there's
now no point in doing that because all this code is elided at compile time.
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, vmap_area_list is exported as VMCOREINFO for makedumpfile to get
the start address of vmalloc region (vmalloc_start). The address which
contains vmalloc_start value is represented as below:
vmap_area_list.next - OFFSET(vmap_area.list) + OFFSET(vmap_area.va_start)
However, both OFFSET(vmap_area.va_start) and OFFSET(vmap_area.list)
aren't exported as VMCOREINFO.
So this patch exports them externally with small cleanup.
[akpm@linux-foundation.org: vmalloc.h should include list.h for list_head]
Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Although our intention is to unexport internal structure entirely, but
there is one exception for kexec. kexec dumps address of vmlist and
makedumpfile uses this information.
We are about to remove vmlist, then another way to retrieve information
of vmalloc layer is needed for makedumpfile. For this purpose, we
export vmap_area_list, instead of vmlist.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drop_caches.c provides code only invokable via sysctl, so don't compile it
in when CONFIG_SYSCTL=n.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we have generic and well ordered cgroup tree walkers there is
no need to keep css_get_next in the place.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ying Han <yinghan@google.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
audit_trim_trees() calls get_tree(). If a failure occurs we must call
put_tree().
[akpm@linux-foundation.org: run put_tree() before mutex_lock() for small scalability improvement]
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In audit_data_to_entry() when a failure occurs we must check and free
the tree and watch to avoid a memory leak.
test:
plan:
test command:
"auditctl -a exit,always -w /etc -F auid=-1"
(on fedora17, need modify auditctl to let "-w /etc" has effect)
running:
under fedora17 x86_64, 2 CPUs 3.20GHz, 2.5GB RAM.
let 15 auditctl processes continue running at the same time.
monitor command:
watch -d -n 1 "cat /proc/meminfo | awk '{print \$2}' \
| head -n 4 | xargs \
| awk '{print \"used \",\$1 - \$2 - \$3 - \$4}'"
result:
for original version:
will use up all memory, within 3 hours.
kill all auditctl, the memory still does not free.
for new version (apply this patch):
after 14 hours later, not find issues.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The files which include kernel/audit.h are complied only when
CONFIG_AUDIT is set.
Just like audit_pid, there is no need to surround audit_ever_enabled
with CONFIG_AUDIT.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
audit_enabled has already been exported in include/linux/audit.h. and
kernel/audit.h includes include/linux/audit.h, no need to export
aduit_enabled again in kernel/audit.h
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We only need to check if kauditd is valid after we start it, if kauditd
is invalid, we will set kauditd_task to NULL. So next time, we will
start kauditd again.
It means if kauditd_task is not NULL,it must be valid.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In audit_alloc_context() use kzalloc instead of kmalloc+memset. Also
rename audit_zero_context() to audit_set_context(), to represent it's
inner workings properly.
[akpm@linux-foundation.org: remove audit_set_context() altogether - fold it into its caller]
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_get_live_kthread() looks confusing and unneeded. It does
get_task_struct() but only kthread_stop() needs this, it can be called
even if the calller doesn't have a reference when we know that this
kthread can't exit until we do kthread_stop().
kthread_park() and kthread_unpark() do not need get_task_struct(), the
callers already have the reference. And it can not help if we can race
with the exiting kthread anyway, kthread_park() can hang forever in this
case.
Change kthread_park() and kthread_unpark() to use to_live_kthread(),
change kthread_stop() to do get_task_struct() by hand and remove
task_get_live_kthread().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"k->vfork_done != NULL" with a barrier() after to_kthread(k) in
task_get_live_kthread(k) looks unclear, and sub-optimal because we load
->vfork_done twice.
All we need is to ensure that we do not return to_kthread(NULL). Add a
new trivial helper which loads/checks ->vfork_done once, this also looks
more understandable.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Along with the usual minor fixes and clean ups there are a few major
changes with this pull request.
1) Multiple buffers for the ftrace facility
This feature has been requested by many people over the last few years.
I even heard that Google was about to implement it themselves. I finally
had time and cleaned up the code such that you can now create multiple
instances of the ftrace buffer and have different events go to different
buffers. This way, a low frequency event will not be lost in the noise
of a high frequency event.
Note, currently only events can go to different buffers, the tracers
(ie. function, function_graph and the latency tracers) still can only
be written to the main buffer.
2) The function tracer triggers have now been extended.
The function tracer had two triggers. One to enable tracing when a
function is hit, and one to disable tracing. Now you can record a
stack trace on a single (or many) function(s), take a snapshot of the
buffer (copy it to the snapshot buffer), and you can enable or disable
an event to be traced when a function is hit.
3) A perf clock has been added.
A "perf" clock can be chosen to be used when tracing. This will cause
ftrace to use the same clock as perf uses, and hopefully this will make
it easier to interleave the perf and ftrace data for analysis.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJRfnTPAAoJEOdOSU1xswtMqYYH/1WIdrwXmxHflErnYkCIr3sU
QtYae2K5A1HcgiqOvRJrdWMOt016iMx5CaQQyBFM1vvMiPY0sTWRmwNxDfZzz9LN
10jRvWEzZSLtzl+a9mkFWLEpr5nR/QODOxkWFCnRWscp46sp04LSTxGDYsOnPQZB
sam/AQ1h4xA+DqDBChm9BDEUEPorGleTlN54LBaCGgSFGvrbF+eAg2s4vHNAQAvQ
8d5xjSE9zC7J+FqbVxvJTbKI3+EqKL6hMsJKsKfi0SI+FuxBaFMSltXck5zKyTI4
HpNJzXCmw+v90Tju7oMkPHh6RTbESPCHoGU+wqE52fM6m7oScVeuI/kfc6USwU4=
=W1n+
-----END PGP SIGNATURE-----
Merge tag 'trace-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Along with the usual minor fixes and clean ups there are a few major
changes with this pull request.
1) Multiple buffers for the ftrace facility
This feature has been requested by many people over the last few
years. I even heard that Google was about to implement it themselves.
I finally had time and cleaned up the code such that you can now
create multiple instances of the ftrace buffer and have different
events go to different buffers. This way, a low frequency event will
not be lost in the noise of a high frequency event.
Note, currently only events can go to different buffers, the tracers
(ie function, function_graph and the latency tracers) still can only
be written to the main buffer.
2) The function tracer triggers have now been extended.
The function tracer had two triggers. One to enable tracing when a
function is hit, and one to disable tracing. Now you can record a
stack trace on a single (or many) function(s), take a snapshot of the
buffer (copy it to the snapshot buffer), and you can enable or disable
an event to be traced when a function is hit.
3) A perf clock has been added.
A "perf" clock can be chosen to be used when tracing. This will cause
ftrace to use the same clock as perf uses, and hopefully this will
make it easier to interleave the perf and ftrace data for analysis."
* tag 'trace-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (82 commits)
tracepoints: Prevent null probe from being added
tracing: Compare to 1 instead of zero for is_signed_type()
tracing: Remove obsolete macro guard _TRACE_PROFILE_INIT
ftrace: Get rid of ftrace_profile_bits
tracing: Check return value of tracing_init_dentry()
tracing: Get rid of unneeded key calculation in ftrace_hash_move()
tracing: Reset ftrace_graph_filter_enabled if count is zero
tracing: Fix off-by-one on allocating stat->pages
kernel: tracing: Use strlcpy instead of strncpy
tracing: Update debugfs README file
tracing: Fix ftrace_dump()
tracing: Rename trace_event_mutex to trace_event_sem
tracing: Fix comment about prefix in arch_syscall_match_sym_name()
tracing: Convert trace_destroy_fields() to static
tracing: Move find_event_field() into trace_events.c
tracing: Use TRACE_MAX_PRINT instead of constant
tracing: Use pr_warn_once instead of open coded implementation
ring-buffer: Add ring buffer startup selftest
tracing: Bring Documentation/trace/ftrace.txt up to date
tracing: Add "perf" trace_clock
...
Conflicts:
kernel/trace/ftrace.c
kernel/trace/trace.c
Pull locking changes from Ingo Molnar:
"The most noticeable change are mutex speedups from Waiman Long, for
higher loads. These scalability changes should be most noticeable on
larger server systems.
There are also cleanups, fixes and debuggability improvements."
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lockdep: Consolidate bug messages into a single print_lockdep_off() function
lockdep: Print out additional debugging advice when we hit lockdep BUGs
mutex: Back out architecture specific check for negative mutex count
mutex: Queue mutex spinners with MCS lock to reduce cacheline contention
mutex: Make more scalable by doing less atomic operations
mutex: Move mutex spinning code from sched/core.c back to mutex.c
locking/rtmutex/tester: Set correct permissions on sysfs files
lockdep: Remove unnecessary 'hlock_next' variable
Reported by Fengguang's kbuild test robot:
kernel/cpuset.c:787: warning: 'generate_sched_domains' defined but not used
Introduced by commit e0e80a02e5
("cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()),
which removed generate_sched_domains() from cpuset_hotplug_workfn().
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
* pm-assorted:
PM / OPP: add documentation to RCU head in struct opp
PM / sleep: invalidate TEST_CPUS and TEST_CORE support for freeze state
PM / sleep: add TEST_PLATFORM support for freeze state
In cpuset_hotplug_workfn(), partition_sched_domains() is called without
hotplug lock held, which is actually needed (stated in the function
header of partition_sched_domains()).
This patch tries to use rebuild_sched_domains() to solve the above
issue, and makes the code looks a little simpler.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull more full-dynticks updates from Frederic Weisbecker:
* Get rid of the passive dependency on VIRT_CPU_ACCOUNTING_GEN (finally!)
* Preparation patch to remove the dependency on CONFIG_64BITS
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I mistakenly removed the call to eventfd->poll() while I was actually
intending to remove the return value...
Calling evenfd->poll() will hook cgroup_event_wake() to the poll
waitqueue, which will be called to unregister eventfd when rmdir a
cgroup or close eventfd.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Try:
# mount -t cgroup xxx /cgroup
# mkdir /cgroup/sub && rmdir /cgroup/sub && umount /cgroup
And you might see this:
ida_remove called for id=1 which is not allocated.
It's because cgroup_kill_sb() is called to destroy root->cgroup_ida
and free cgrp->root before ida_simple_removed() is called. What's
worse is we're accessing cgrp->root while it has been freed.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Turn the full dynticks passive dependency on VIRT_CPU_ACCOUNTING_GEN
to an active one.
The full dynticks Kconfig is currently hidden behind the full dynticks
cputime accounting, which is an awkward and counter-intuitive layout:
the user first has to select the dynticks cputime accounting in order
to make the full dynticks feature to be visible.
We definetly want it the other way around. The usual way to perform
this kind of active dependency is use "select" on the depended target.
Now we can't use the Kconfig "select" instruction when the target is
a "choice".
So this patch inspires on how the RCU subsystem Kconfig interact
with its dependencies on SMP and PREEMPT: we make sure that cputime
accounting can't propose another option than VIRT_CPU_ACCOUNTING_GEN
when NO_HZ_FULL is selected by using the right "depends on" instruction
for each cputime accounting choices.
v2: Keep full dynticks cputime accounting available even without
full dynticks, as per Paul McKenney's suggestion.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
On my SMP platform which is made of 5 cores in 2 clusters, I
have the nr_busy_cpu field of sched_group_power struct that is
not null when the platform is fully idle - which makes the
scheduler unhappy.
The root cause is:
During the boot sequence, some CPUs reach the idle loop and set
their NOHZ_IDLE flag while waiting for others CPUs to boot. But
the nr_busy_cpus field is initialized later with the assumption
that all CPUs are in the busy state whereas some CPUs have
already set their NOHZ_IDLE flag.
More generally, the NOHZ_IDLE flag must be initialized when new
sched_domains are created in order to ensure that NOHZ_IDLE and
nr_busy_cpus are aligned.
This condition can be ensured by adding a synchronize_rcu()
between the destruction of old sched_domains and the creation of
new ones so the NOHZ_IDLE flag will not be updated with old
sched_domain once it has been initialized. But this solution
introduces a additionnal latency in the rebuild sequence that is
called during cpu hotplug.
As suggested by Frederic Weisbecker, another solution is to have
the same rcu lifecycle for both NOHZ_IDLE and sched_domain
struct. A new nohz_idle field is added to sched_domain so both
status and sched_domain will share the same RCU lifecycle and
will be always synchronized. In addition, there is no more need
to protect nohz_idle against concurrent access as it is only
modified by 2 exclusive functions called by local cpu.
This solution has been prefered to the creation of a new struct
with an extra pointer indirection for sched_domain.
The synchronization is done at the cost of :
- An additional indirection and a rcu_dereference for accessing nohz_idle.
- We use only the nohz_idle field of the top sched_domain.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: linaro-kernel@lists.linaro.org
Cc: peterz@infradead.org
Cc: fweisbec@gmail.com
Cc: pjt@google.com
Cc: rostedt@goodmis.org
Cc: efault@gmx.de
Link: http://lkml.kernel.org/r/1366729142-14662-1-git-send-email-vincent.guittot@linaro.org
[ Fixed !NO_HZ build bug. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
One testbox of mine (Intel Nehalem, 16-way) uses MWAIT for its idle routine,
which apparently can break out of its idle loop rather frequently, with
high frequency.
In that case NO_HZ_FULL=y kernels show high ksoftirqd overhead and constant
context switching, because tick_nohz_stop_sched_tick() will, if
delta_jiffies == 0, mis-identify this as a timer event - activating the
TIMER_SOFTIRQ, which wakes up ksoftirqd.
Fix this by treating delta_jiffies == 0 the same way we treat other short
wakeups, delta_jiffies == 1.
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Also add some missing printk levels.
Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20130425174002.GA26769@redhat.com
[ Tweaked the messages a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We occasionally get reports of these BUGs being hit, and the
stack trace doesn't necessarily always tell us what we need to
know about why we are hitting those limits.
If users start attaching /proc/lock_stats to reports we may have
more of a clue what's going on.
Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20130423163403.GA12839@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vitaliy reported that a per cpu HPET timer interrupt crashes the
system during hibernation. What happens is that the per cpu HPET timer
gets shut down when the nonboot cpus are stopped. When the nonboot
cpus are onlined again the HPET code sets up the MSI interrupt which
fires before the clock event device is registered. The event handler
is still set to hrtimer_interrupt, which then crashes the machine due
to highres mode not being active.
See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=700333
There is no real good way to avoid that in the HPET code. The HPET
code alrady has a mechanism to detect spurious interrupts when event
handler == NULL for a similar reason.
We can handle that in the clockevent/tick layer and replace the
previous functional handler with a dummy handler like we do in
tick_setup_new_device().
The original clockevents code did this in clockevents_exchange_device(),
but that got removed by commit 7c1e76897 (clockevents: prevent
clockevent event_handler ending up handler_noop) which forgot to fix
it up in tick_shutdown(). Same issue with the broadcast device.
Reported-by: Vitaliy Fillipov <vitalif@yourcmc.ru>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: stable@vger.kernel.org
Cc: 700333@bugs.debian.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Remove the dependency on (TREE_RCU || TREE_PREEMPT_RCU). The full
dynticks option already depends on SMP which implies
(whatever flavour of) RCU tree config anyway.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Commit 88b8dac0 makes load_balance() consider other cpus in its
group. But, in that, there is no code for preventing to
re-select dst-cpu. So, same dst-cpu can be selected over and
over.
This patch add functionality to load_balance() in order to
exclude cpu which is selected once. We prevent to re-select
dst_cpu via env's cpus, so now, env's cpus is a candidate not
only for src_cpus, but also dst_cpus.
With this patch, we can remove lb_iterations and
max_lb_iterations, because we decide whether we can go ahead or
not via env's cpus.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Jason Low <jason.low2@hp.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-7-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This name doesn't represent specific meaning.
So rename it to imply it's purpose.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Jason Low <jason.low2@hp.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, LBF_ALL_PINNED is cleared after affinity check is
passed. So, if task migration is skipped by small load value or
small imbalance value in move_tasks(), we don't clear
LBF_ALL_PINNED. At last, we trigger 'redo' in load_balance().
Imbalance value is often so small that any tasks cannot be moved
to other cpus and, of course, this situation may be continued
after we change the target cpu. So this patch move up affinity
check code and clear LBF_ALL_PINNED before evaluating load value
in order to mitigate useless redoing overhead.
In addition, re-order some comments correctly.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Jason Low <jason.low2@hp.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-5-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 88b8dac0 makes load_balance() consider other cpus in its
group, regardless of idle type. When we do NEWLY_IDLE balancing,
we should not consider it, because a motivation of NEWLY_IDLE
balancing is to turn this cpu to non idle state if needed. This
is not the case of other cpus. So, change code not to consider
other cpus for NEWLY_IDLE balancing.
With this patch, assign 'if (pulled_task) this_rq->idle_stamp =
0' in idle_balance() is corrected, because NEWLY_IDLE balancing
doesn't consider other cpus. Assigning to 'this_rq->idle_stamp'
is now valid.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Tested-by: Jason Low <jason.low2@hp.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-4-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After commit 88b8dac0, dst-cpu can be changed in load_balance(),
then we can't know cpu_idle_type of dst-cpu when load_balance()
return positive. So, add explicit cpu_idle_type checking.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Tested-by: Jason Low <jason.low2@hp.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cur_ld_moved is reset if env.flags hit LBF_NEED_BREAK.
So, there is possibility that we miss doing resched_cpu().
Correct it as changing position of resched_cpu()
before checking LBF_NEED_BREAK.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Tested-by: Jason Low <jason.low2@hp.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366705662-3587-2-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Conflicts:
drivers/net/ethernet/emulex/benet/be_main.c
drivers/net/ethernet/intel/igb/igb_main.c
drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c
include/net/scm.h
net/batman-adv/routing.c
net/ipv4/tcp_input.c
The e{uid,gid} --> {uid,gid} credentials fix conflicted with the
cleanup in net-next to now pass cred structs around.
The be2net driver had a bug fix in 'net' that overlapped with the VLAN
interface changes by Patrick McHardy in net-next.
An IGB conflict existed because in 'net' the build_skb() support was
reverted, and in 'net-next' there was a comment style fix within that
code.
Several batman-adv conflicts were resolved by making sure that all
calls to batadv_is_my_mac() are changed to have a new bat_priv first
argument.
Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO
rewrite in 'net-next', mostly overlapping changes.
Thanks to Stephen Rothwell and Antonio Quartulli for help with several
of these merge resolutions.
Signed-off-by: David S. Miller <davem@davemloft.net>
It's not obvious to find out why the full dynticks subsystem
doesn't always stop the tick: whether this is due to kthreads,
posix timers, perf events, etc...
These new tracepoints are here to help the user diagnose
the failures and test this feature.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
It makes testing and implementation much easier as we
know in advance that all CPUs are RCU nocbs.
Also this prepares to remove the dynamic check for
nohz_full= boot mask to be a subset of rcu_nocbs=
Eventually this should also help removing the requirement
for the boot CPU to be outside the full dynticks range.
Suggested-by: Christoph Lameter <cl@linux.com>
Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Eventually try to disable tick on irq exit, now that the
fundamental infrastructure is in place.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
When a task is scheduled in, it may have some properties
of its own that could make the CPU reconsider the need for
the tick: posix cpu timers, perf events, ...
So notify the full dynticks subsystem when a task gets
scheduled in and re-check the tick dependency at this
stage. This is done through a self IPI to avoid messing
up with any current lock scenario.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Interrupt exit is a natural place to stop the tick: it happens
after all events happening before and during the irq which
are liable to update the dependency on the tick occured. Also
it makes sure that any check on tick dependency is well ordered
against dynticks kick IPIs.
Bring in the infrastructure that performs the tick dependency
checks on irq exit and shut it down if these checks show that we
can do it safely.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Implement the full dynticks kick that is performed from
IPIs sent by various subsystems (scheduler, posix timers, ...)
when they want to notify about a new event that may
reconsider the dependency on the tick.
Most of the time, such an event end up restarting the tick.
(Part of the design with subsystems providing *_can_stop_tick()
helpers suggested by Peter Zijlstra a while ago).
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
commit 7ec98e15aa (timekeeping: Delay update of clock->cycle_last)
forgot to update tk->cycle_last in the resume path. This results in a
stale value versus clock->cycle_last and prevents resume in the worst
case.
Reported-by: Jiri Slaby <jslaby@suse.cz>
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Linux-pm mailing list <linux-pm@lists.linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304211648150.21884@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The scheduler IPI is used by the scheduler to kick
full dynticks CPUs asynchronously when more than one
task are running or when a new timer list timer is
enqueued. This way the destination CPU can decide
to restart the tick to handle this new situation.
Now let's call that kick in the scheduler IPI.
(Reusing the scheduler IPI rather than implementing
a new IPI was suggested by Peter Zijlstra a while ago)
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Provide a new helper to be called from the full dynticks engine
before stopping the tick in order to make sure we don't stop
it when there is more than one task running on the CPU.
This way we make sure that the tick stays alive to maintain
fairness.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Kick the tick on full dynticks CPUs when they get more
than one task running on their queue. This makes sure that
local fairness is maintained by the tick on the destination.
This is done regardless of these tasks' class. We should
be able to be more clever in the future depending on these. eg:
a CPU that runs a SCHED_FIFO task doesn't need to maintain
fairness against local pending tasks of the fair class.
But keep things simple for now.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Provide a new helper that help full dynticks CPUs to prevent
from stopping their tick in case there are events in the local
rotation list.
This way we make sure that perf_event_task_tick() is serviced
on demand.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Kick the current CPU's tick by sending it a self IPI when
an event is queued on the rotation list and it is the first
element inserted. This makes sure that perf_event_task_tick()
works on full dynticks CPUs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
The test that checks if a CPU can stop its tick from posix CPU
timers angle was mistakenly inverted.
What we want is to prevent the tick from being stopped as long
as the current CPU's task runs a posix CPU timer.
Fix this.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
The current update of the rq's load can be erroneous when RT
tasks are involved.
The update of the load of a rq that becomes idle, is done only
if the avg_idle is less than sysctl_sched_migration_cost. If RT
tasks and short idle duration alternate, the runnable_avg will
not be updated correctly and the time will be accounted as idle
time when a CFS task wakes up.
A new idle_enter function is called when the next task is the
idle function so the elapsed time will be accounted as run time
in the load of the rq, whatever the average idle time is. The
function update_rq_runnable_avg is removed from idle_balance.
When a RT task is scheduled on an idle CPU, the update of the
rq's load is not done when the rq exit idle state because CFS's
functions are not called. Then, the idle_balance, which is
called just before entering the idle function, updates the rq's
load and makes the assumption that the elapsed time since the
last update, was only running time.
As a consequence, the rq's load of a CPU that only runs a
periodic RT task, is close to LOAD_AVG_MAX whatever the running
duration of the RT task is.
A new idle_exit function is called when the prev task is the
idle function so the elapsed time will be accounted as idle time
in the rq's load.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: linaro-kernel@lists.linaro.org
Cc: peterz@infradead.org
Cc: pjt@google.com
Cc: fweisbec@gmail.com
Cc: efault@gmx.de
Link: http://lkml.kernel.org/r/1366302867-5055-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Conflicts:
arch/x86/kernel/cpu/perf_event_intel.c
Merge in the latest fixes before applying new patches, resolve the conflict.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull kdump fixes from Peter Anvin:
"The kexec/kdump people have found several problems with the support
for loading over 4 GiB that was introduced in this merge cycle. This
is partly due to a number of design problems inherent in the way the
various pieces of kdump fit together (it is pretty horrifically manual
in many places.)
After a *lot* of iterations this is the patchset that was agreed upon,
but of course it is now very late in the cycle. However, because it
changes both the syntax and semantics of the crashkernel option, it
would be desirable to avoid a stable release with the broken
interfaces."
I'm not happy with the timing, since originally the plan was to release
the final 3.9 tomorrow. But apparently I'm doing an -rc8 instead...
* 'x86-kdump-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kexec: use Crash kernel for Crash kernel low
x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low
x86, kdump: Retore crashkernel= to allocate under 896M
x86, kdump: Set crashkernel_low automatically
Somehow tracepoint_entry_add_probe() function allows a null probe function.
And, this may lead to unexpected results since the number of probe
functions in an entry can be counted by checking whether a probe is null
or not in the for-loop.
This patch prevents a null probe from being added.
In tracepoint_entry_remove_probe() function, checking probe parameter
within the for-loop is moved out for code efficiency, leaving the null probe
feature which removes all probe functions in the entry.
Link: http://lkml.kernel.org/r/1365991995-19445-1-git-send-email-kpark3469@gmail.com
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Sahara <keun-o.park@windriver.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Bring a new helper that the full dynticks infrastructure can
call in order to know if it can safely stop the tick from
the posix cpu timers subsystem point of view.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Kick the full dynticks CPUs when a posix cpu timer is enqueued by
way of a standard call to posix_cpu_timer_set() or set_process_cpu_timer().
This also include rescheduled firing timers.
This way they can re-evaluate the state of (and possibly restart) their
tick against the new expiry modification.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Provide a new kernel config that defaults all CPUs to be part
of the full dynticks range, except the boot one for timekeeping.
This default setting is overriden by the nohz_full= boot option
if passed by the user.
This is helpful for those who don't need a finegrained range
of full dynticks CPU and also for automated testing.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
We need full dynticks CPU to also be RCU nocb so
that we don't have to keep the tick to handle RCU
callbacks.
Make sure the range passed to nohz_full= boot
parameter is a subset of rcu_nocbs=
The CPUs that fail to meet this requirement will be
excluded from the nohz_full range. This is checked
early in boot time, before any CPU has the opportunity
to stop its tick.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
The timekeeping job must be able to run early on boot
because there may be some pre-SMP (and thus pre-initcalls )
components that rely on it. The IO-APIC is one such users
as it tests the timer health by watching jiffies progression.
Given that it happens before we know the initial online
set, we can't rely on it to select a timekeeper. We need
one before SMP time otherwise we simply crash on boot.
To fix this and keep things simple for now, force the boot CPU
outside of the full dynticks range in any case and do this early
on kernel parameter parsing time.
We might want a trickier solution later, expecially for aSMP
architectures that need to assign housekeeping tasks to arbitrary
low power CPUs.
But it's still first pass KISS time for now.
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Linus suggested that probably all the supported architectures can
allow a negative mutex count without incorrect behavior, so we can
then back out the architecture specific change and allow the
mutex count to go to any negative number. That should further
reduce contention for non-x86 architecture.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chandramouleeswaran Aswin <aswin@hp.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Norton Scott J <scott.norton@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366226594-5506-5-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current mutex spinning code (with MUTEX_SPIN_ON_OWNER option
turned on) allow multiple tasks to spin on a single mutex
concurrently. A potential problem with the current approach is
that when the mutex becomes available, all the spinning tasks
will try to acquire the mutex more or less simultaneously. As a
result, there will be a lot of cacheline bouncing especially on
systems with a large number of CPUs.
This patch tries to reduce this kind of contention by putting
the mutex spinners into a queue so that only the first one in
the queue will try to acquire the mutex. This will reduce
contention and allow all the tasks to move forward faster.
The queuing of mutex spinners is done using an MCS lock based
implementation which will further reduce contention on the mutex
cacheline than a similar ticket spinlock based implementation.
This patch will add a new field into the mutex data structure
for holding the MCS lock. This expands the mutex size by 8 bytes
for 64-bit system and 4 bytes for 32-bit system. This overhead
will be avoid if the MUTEX_SPIN_ON_OWNER option is turned off.
The following table shows the jobs per minute (JPM) scalability
data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The
numactl command is used to restrict the running of the fserver
workloads to 1/2/4/8 nodes with hyperthreading off.
+-----------------+-----------+-----------+-------------+----------+
| Configuration | Mean JPM | Mean JPM | Mean JPM | % Change |
| | w/o patch | patch 1 | patches 1&2 | 1->1&2 |
+-----------------+------------------------------------------------+
| | User Range 1100 - 2000 |
+-----------------+------------------------------------------------+
| 8 nodes, HT off | 227972 | 227237 | 305043 | +34.2% |
| 4 nodes, HT off | 393503 | 381558 | 394650 | +3.4% |
| 2 nodes, HT off | 334957 | 325240 | 338853 | +4.2% |
| 1 node , HT off | 198141 | 197972 | 198075 | +0.1% |
+-----------------+------------------------------------------------+
| | User Range 200 - 1000 |
+-----------------+------------------------------------------------+
| 8 nodes, HT off | 282325 | 312870 | 332185 | +6.2% |
| 4 nodes, HT off | 390698 | 378279 | 393419 | +4.0% |
| 2 nodes, HT off | 336986 | 326543 | 340260 | +4.2% |
| 1 node , HT off | 197588 | 197622 | 197582 | 0.0% |
+-----------------+-----------+-----------+-------------+----------+
At low user range 10-100, the JPM differences were within +/-1%.
So they are not that interesting.
The fserver workload uses mutex spinning extensively. With just
the mutex change in the first patch, there is no noticeable
change in performance. Rather, there is a slight drop in
performance. This mutex spinning patch more than recovers the
lost performance and show a significant increase of +30% at high
user load with the full 8 nodes. Similar improvements were also
seen in a 3.8 kernel.
The table below shows the %time spent by different kernel
functions as reported by perf when running the fserver workload
at 1500 users with all 8 nodes.
+-----------------------+-----------+---------+-------------+
| Function | % time | % time | % time |
| | w/o patch | patch 1 | patches 1&2 |
+-----------------------+-----------+---------+-------------+
| __read_lock_failed | 34.96% | 34.91% | 29.14% |
| __write_lock_failed | 10.14% | 10.68% | 7.51% |
| mutex_spin_on_owner | 3.62% | 3.42% | 2.33% |
| mspin_lock | N/A | N/A | 9.90% |
| __mutex_lock_slowpath | 1.46% | 0.81% | 0.14% |
| _raw_spin_lock | 2.25% | 2.50% | 1.10% |
+-----------------------+-----------+---------+-------------+
The fserver workload for an 8-node system is dominated by the
contention in the read/write lock. Mutex contention also plays a
role. With the first patch only, mutex contention is down (as
shown by the __mutex_lock_slowpath figure) which help a little
bit. We saw only a few percents improvement with that.
By applying patch 2 as well, the single mutex_spin_on_owner
figure is now split out into an additional mspin_lock figure.
The time increases from 3.42% to 11.23%. It shows a great
reduction in contention among the spinners leading to a 30%
improvement. The time ratio 9.9/2.33=4.3 indicates that there
are on average 4+ spinners waiting in the spin_lock loop for
each spinner in the mutex_spin_on_owner loop. Contention in
other locking functions also go down by quite a lot.
The table below shows the performance change of both patches 1 &
2 over patch 1 alone in other AIM7 workloads (at 8 nodes,
hyperthreading off).
+--------------+---------------+----------------+-----------------+
| Workload | mean % change | mean % change | mean % change |
| | 10-100 users | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests | 0.0% | -0.8% | +0.6% |
| five_sec | -0.3% | +0.8% | +0.8% |
| high_systime | +0.4% | +2.4% | +2.1% |
| new_fserver | +0.1% | +14.1% | +34.2% |
| shared | -0.5% | -0.3% | -0.4% |
| short | -1.7% | -9.8% | -8.3% |
+--------------+---------------+----------------+-----------------+
The short workload is the only one that shows a decline in
performance probably due to the spinner locking and queuing
overhead.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chandramouleeswaran Aswin <aswin@hp.com>
Cc: Norton Scott J <scott.norton@hp.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366226594-5506-4-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the __mutex_lock_common() function, an initial entry into
the lock slow path will cause two atomic_xchg instructions to be
issued. Together with the atomic decrement in the fast path, a
total of three atomic read-modify-write instructions will be
issued in rapid succession. This can cause a lot of cache
bouncing when many tasks are trying to acquire the mutex at the
same time.
This patch will reduce the number of atomic_xchg instructions
used by checking the counter value first before issuing the
instruction. The atomic_read() function is just a simple memory
read. The atomic_xchg() function, on the other hand, can be up
to 2 order of magnitude or even more in cost when compared with
atomic_read(). By using atomic_read() to check the value first
before calling atomic_xchg(), we can avoid a lot of unnecessary
cache coherency traffic. The only downside with this change is
that a task on the slow path will have a tiny bit less chance of
getting the mutex when competing with another task in the fast
path.
The same is true for the atomic_cmpxchg() function in the
mutex-spin-on-owner loop. So an atomic_read() is also performed
before calling atomic_cmpxchg().
The mutex locking and unlocking code for the x86 architecture
can allow any negative number to be used in the mutex count to
indicate that some tasks are waiting for the mutex. I am not so
sure if that is the case for the other architectures. So the
default is to avoid atomic_xchg() if the count has already been
set to -1. For x86, the check is modified to include all
negative numbers to cover a larger case.
The following table shows the jobs per minutes (JPM) scalability
data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The
numactl command is used to restrict the running of the
high_systime workloads to 1/2/4/8 nodes with hyperthreading on
and off.
+-----------------+-----------+------------+----------+
| Configuration | Mean JPM | Mean JPM | % Change |
| | w/o patch | with patch | |
+-----------------+-----------------------------------+
| | User Range 1100 - 2000 |
+-----------------+-----------------------------------+
| 8 nodes, HT on | 36980 | 148590 | +301.8% |
| 8 nodes, HT off | 42799 | 145011 | +238.8% |
| 4 nodes, HT on | 61318 | 118445 | +51.1% |
| 4 nodes, HT off | 158481 | 158592 | +0.1% |
| 2 nodes, HT on | 180602 | 173967 | -3.7% |
| 2 nodes, HT off | 198409 | 198073 | -0.2% |
| 1 node , HT on | 149042 | 147671 | -0.9% |
| 1 node , HT off | 126036 | 126533 | +0.4% |
+-----------------+-----------------------------------+
| | User Range 200 - 1000 |
+-----------------+-----------------------------------+
| 8 nodes, HT on | 41525 | 122349 | +194.6% |
| 8 nodes, HT off | 49866 | 124032 | +148.7% |
| 4 nodes, HT on | 66409 | 106984 | +61.1% |
| 4 nodes, HT off | 119880 | 130508 | +8.9% |
| 2 nodes, HT on | 138003 | 133948 | -2.9% |
| 2 nodes, HT off | 132792 | 131997 | -0.6% |
| 1 node , HT on | 116593 | 115859 | -0.6% |
| 1 node , HT off | 104499 | 104597 | +0.1% |
+-----------------+------------+-----------+----------+
At low user range 10-100, the JPM differences were within +/-1%.
So they are not that interesting.
AIM7 benchmark run has a pretty large run-to-run variance due to
random nature of the subtests executed. So a difference of less
than +-5% may not be really significant.
This patch improves high_systime workload performance at 4 nodes
and up by maintaining transaction rates without significant
drop-off at high node count. The patch has practically no
impact on 1 and 2 nodes system.
The table below shows the percentage time (as reported by perf
record -a -s -g) spent on the __mutex_lock_slowpath() function
by the high_systime workload at 1500 users for 2/4/8-node
configurations with hyperthreading off.
+---------------+-----------------+------------------+---------+
| Configuration | %Time w/o patch | %Time with patch | %Change |
+---------------+-----------------+------------------+---------+
| 8 nodes | 65.34% | 0.69% | -99% |
| 4 nodes | 8.70% | 1.02% | -88% |
| 2 nodes | 0.41% | 0.32% | -22% |
+---------------+-----------------+------------------+---------+
It is obvious that the dramatic performance improvement at 8
nodes was due to the drastic cut in the time spent within the
__mutex_lock_slowpath() function.
The table below show the improvements in other AIM7 workloads
(at 8 nodes, hyperthreading off).
+--------------+---------------+----------------+-----------------+
| Workload | mean % change | mean % change | mean % change |
| | 10-100 users | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests | +0.6% | +104.2% | +185.9% |
| five_sec | +1.9% | +0.9% | +0.9% |
| fserver | +1.4% | -7.7% | +5.1% |
| new_fserver | -0.5% | +3.2% | +3.1% |
| shared | +13.1% | +146.1% | +181.5% |
| short | +7.4% | +5.0% | +4.2% |
+--------------+---------------+----------------+-----------------+
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chandramouleeswaran Aswin <aswin@hp.com>
Cc: Norton: Scott J <scott.norton@hp.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366226594-5506-3-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler
feature bit was really just an early hack to make with/without
mutex-spinning testable. So it is no longer necessary.
This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and
move the mutex spinning code from kernel/sched/core.c back to
kernel/mutex.c which is where they should belong.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chandramouleeswaran Aswin <aswin@hp.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Norton Scott J <scott.norton@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366226594-5506-2-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We should store file xattrs in struct cfent instead of struct cftype,
because cftype is a type while cfent is object instance of cftype.
For example each cgroup has a tasks file, and each tasks file is
associated with a uniq cfent, but all those files share the same
struct cftype.
Alexey Kodanev reported a crash, which can be reproduced:
# mount -t cgroup -o xattr /sys/fs/cgroup
# mkdir /sys/fs/cgroup/test
# setfattr -n trusted.value -v test_value /sys/fs/cgroup/tasks
# rmdir /sys/fs/cgroup/test
# umount /sys/fs/cgroup
oops!
In this case, simple_xattrs_free() will free the same struct simple_xattrs
twice.
tj: Dropped unused local variable @cft from cgroup_diput().
Cc: <stable@vger.kernel.org> # 3.8.x
Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull user-namespace fixes from Andy Lutomirski.
* 'userns-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux:
userns: Changing any namespace id mappings should require privileges
userns: Check uid_map's opener's fsuid, not the current fsuid
userns: Don't let unprivileged users trick privileged users into setting the id_map
Provide two new helpers in order to notify the full dynticks CPUs about
some internal system changes against which they may reconsider the state
of their tick. Some practical examples include: posix cpu timers, perf tick
and sched clock tick.
For now the notifying handler, implemented through IPIs, is a stub
that will be implemented when we get the tick stop/restart infrastructure
in.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
This reverts commit 3a366e614d.
Wanlong Gao reports that it causes a kernel panic on his machine several
minutes after boot. Reverting it removes the panic.
Jens says:
"It's not quite clear why that is yet, so I think we should just revert
the commit for 3.9 final (which I'm assuming is pretty close).
The wifi is crap at the LSF hotel, so sending this email instead of
queueing up a revert and pull request."
Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Requested-by: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a double locking bug caused when debug.kprobe-optimization=0.
While the proc_kprobes_optimization_handler locks kprobe_mutex,
wait_for_kprobe_optimizer locks it again and that causes a double lock.
To fix the bug, this introduces different mutex for protecting
sysctl parameter and locks it in proc_kprobes_optimization_handler.
Of course, since we need to lock kprobe_mutex when touching kprobes
resources, that is done in *optimize_all_kprobes().
This bug was introduced by commit ad72b3bea7 ("kprobes: fix
wait_for_kprobe_optimizer()")
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the unused variable *node introduced by commit 5ed67f05 (posix
timers: Allocate timer id per process)
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Pavel Emelyanov <xemul@parallels.com>
This fixes a kernel memory contents leak via the tkill and tgkill syscalls
for compat processes.
This is visible in the siginfo_t->_sifields._rt.si_sigval.sival_ptr field
when handling signals delivered from tkill.
The place of the infoleak:
int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t *from)
{
...
put_user_ex(ptr_to_compat(from->si_ptr), &to->si_ptr);
...
}
Signed-off-by: Emese Revfy <re.emese@gmail.com>
Reviewed-by: PaX Team <pageexec@freemail.hu>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can extend kexec-tools to support multiple "Crash kernel" in /proc/iomem
instead.
So we can use "Crash kernel" instead of "Crash kernel low" in /proc/iomem.
Suggested-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-3-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Per hpa, use crashkernel=X,high crashkernel=Y,low instead of
crashkernel_hign=X crashkernel_low=Y. As that could be extensible.
-v2: according to Vivek, change delimiter to ;
-v3: let hign and low only handle simple form and it conforms to
description in kernel-parameters.txt
still keep crashkernel=X override any crashkernel=X,high
crashkernel=Y,low
-v4: update get_last_crashkernel returning and add more strict
checking in parse_crashkernel_simple() found by HATAYAMA.
-v5: Change delimiter back to , according to HPA.
also separate parse_suffix from parse_simper according to vivek.
so we can avoid @pos in that path.
-v6: Tight the checking about crashkernel=X,highblahblah,high
found by HTYAYAMA.
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-5-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Vivek found old kexec-tools does not work new kernel anymore.
So change back crashkernel= back to old behavoir, and add crashkernel_high=
to let user decide if buffer could be above 4G, and also new kexec-tools will
be needed.
-v2: let crashkernel=X override crashkernel_high=
update description about _high will be ignored by crashkernel=X
-v3: update description about kernel-parameters.txt according to Vivek.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1366089828-19692-4-git-send-email-yinghai@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
tick_oneshot_notify() is used to notify a particular CPU to try
to switch into oneshot mode after a oneshot capable tick device
is registered and tick_clock_notify() is used to notify all CPUs
to try to switch into oneshot mode after a high res clocksource
is registered. There is one caveat; if the tick devices suffer
from FEAT_C3_STOP we don't try to switch into oneshot mode unless
we have a oneshot capable broadcast device already registered.
If the broadcast device is registered after the tick devices that
have FEAT_C3_STOP we'll never try to switch into oneshot mode
again, causing us to be stuck in periodic mode forever. Avoid
this scenario by calling tick_clock_notify() after we register
the broadcast device so that we try to switch into oneshot mode
on all CPUs one more time.
[ tglx: Adopted to timers/core and added a comment ]
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/1366219566-29783-1-git-send-email-sboyd@codeaurora.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When running with 4096 cores attemping to read /proc/timer_list will fail
with an ENOMEM condition. On a sufficantly large systems the total amount
of data is more then 4mb, so it won't fit into a single buffer. The
failure can also occur on smaller systems when memory fragmentation is
high as reported by Dave Jones.
Convert /proc/timer_list to a proper seq_file with its own iterator. This
is a little more complex given that we have to make two passes with two
separate headers.
sysrq_timer_list_show also needed to be updated to reflect the fact that
now timer_list_show only does one cpu at at time.
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Reported-by: Dave Jones <davej@redhat.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/1364345790-14577-3-git-send-email-nzimmer@sgi.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Split timer_list_show_tickdevices() into the header printout and pull
the rest up to timer_list_show. This is a preparatory patch for
converting timer_list to a proper seqfile with its own iterator
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Reported-by: Dave Jones <davej@redhat.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/1364345790-14577-2-git-send-email-nzimmer@sgi.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently kernel generates IDs for posix timers in a global manner --
there's a kernel-wide IDR tree from which IDs are created. This makes
it impossible to recreate a timer with a desired ID (in particular
this is done by the CRIU checkpoint-restore project) -- since these
IDs are global it may happen, that at the time we recreate a timer, the
ID we want for it is already busy by some other timer.
In order to address this, replace the IDR tree with a global hash
table for timers and makes timer IDs unique per signal_struct (to
which timers are linked anyway). With this, two timers belonging to
different processes may have equal IDs and we can recreate either of
them with the ID we want.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Helsley <matt.helsley@gmail.com>
Link: http://lkml.kernel.org/r/513D9FF5.9010004@parallels.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Otherwise we get a race between unload and reload of the same module:
the new module doesn't see the old one in the list, but then fails because
it can't register over the still-extant entries in sysfs:
[ 103.981925] ------------[ cut here ]------------
[ 103.986902] WARNING: at fs/sysfs/dir.c:536 sysfs_add_one+0xab/0xd0()
[ 103.993606] Hardware name: CrownBay Platform
[ 103.998075] sysfs: cannot create duplicate filename '/module/pch_gbe'
[ 104.004784] Modules linked in: pch_gbe(+) [last unloaded: pch_gbe]
[ 104.011362] Pid: 3021, comm: modprobe Tainted: G W 3.9.0-rc5+ #5
[ 104.018662] Call Trace:
[ 104.021286] [<c103599d>] warn_slowpath_common+0x6d/0xa0
[ 104.026933] [<c1168c8b>] ? sysfs_add_one+0xab/0xd0
[ 104.031986] [<c1168c8b>] ? sysfs_add_one+0xab/0xd0
[ 104.037000] [<c1035a4e>] warn_slowpath_fmt+0x2e/0x30
[ 104.042188] [<c1168c8b>] sysfs_add_one+0xab/0xd0
[ 104.046982] [<c1168dbe>] create_dir+0x5e/0xa0
[ 104.051633] [<c1168e78>] sysfs_create_dir+0x78/0xd0
[ 104.056774] [<c1262bc3>] kobject_add_internal+0x83/0x1f0
[ 104.062351] [<c126daf6>] ? kvasprintf+0x46/0x60
[ 104.067231] [<c1262ebd>] kobject_add_varg+0x2d/0x50
[ 104.072450] [<c1262f07>] kobject_init_and_add+0x27/0x30
[ 104.078075] [<c1089240>] mod_sysfs_setup+0x80/0x540
[ 104.083207] [<c1260851>] ? module_bug_finalize+0x51/0xc0
[ 104.088720] [<c108ab29>] load_module+0x1429/0x18b0
We can teardown sysfs first, then to be sure, put the state in
MODULE_STATE_UNFORMED so it's ignored while we deconstruct it.
Reported-by: Veaceslav Falico <vfalico@redhat.com>
Tested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When userspace sends messages to the audit system it includes a type.
We want to be able to filter messages based on that type without have to
do the all or nothing option currently available on the
AUDIT_FILTER_TYPE filter list. Instead we should be able to use the
AUDIT_FILTER_USER filter list and just use the message type as one part
of the matching decision.
Signed-off-by: Eric Paris <eparis@redhat.com>
Looks like this one has been around since 5195d8e21:
kernel/auditsc.c: In function ‘audit_free_names’:
kernel/auditsc.c:998: error: ‘i’ undeclared (first use in this function)
...and this warning:
kernel/auditsc.c: In function ‘audit_putname’:
kernel/auditsc.c:2045: warning: ‘i’ may be used uninitialized in this function
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Pull uprobes updates from Oleg Nesterov:
- "uretprobes" - an optimization to uprobes, like kretprobes are an optimization
to kprobes. "perf probe -x file sym%return" now works like kretprobes.
- PowerPC fixes plus a couple of cleanups/optimizations in uprobes and trace_uprobes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Adaptive-ticks CPUs inform RCU when they enter kernel mode, but they do
not necessarily turn the scheduler-clock tick back on. This state of
affairs could result in RCU waiting on an adaptive-ticks CPU running
for an extended period in kernel mode. Such a CPU will never run the
RCU state machine, and could therefore indefinitely extend the RCU state
machine, sooner or later resulting in an OOM condition.
This patch, inspired by an earlier patch by Frederic Weisbecker, therefore
causes RCU's force-quiescent-state processing to check for this condition
and to send an IPI to CPUs that remain in that state for too long.
"Too long" currently means about three jiffies by default, which is
quite some time for a CPU to remain in the kernel without blocking.
The rcu_tree.jiffies_till_first_fqs and rcutree.jiffies_till_next_fqs
sysfs variables may be used to tune "too long" if needed.
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Remove the "single task" statement from CONFIG_NO_HZ_FULL
title. The constraint can be invalidated when tasks from
other sched classes than SCHED_FAIR are running. Moreover
it's possible that hrtick join the party in the future.
Also add a line about the dependency on SMP.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Rename CONFIG_PERIODIC_HZ to CONFIG_HZ_PERIODIC in
order to stay consistent with other tick implementation
entries:
CONFIG_HZ_PERIODIC
CONFIG_NO_HZ_IDLE
CONFIG_NO_HZ_FULL
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
"Extended nohz" was used as a naming base for the full dynticks
API and Kconfig symbols. It reflects the fact the system tries
to stop the tick in more places than just idle.
But that "extended" name is a bit opaque and vague. Rename it to
"full" makes it clearer what the system tries to do under this
config: try to shutdown the tick anytime it can. The various
constraints that prevent that to happen shouldn't be considered
as fundamental properties of this feature but rather technical
issues that may be solved in the future.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
In order to enforce backward compatibility with older
config files, we want the new dynticks-idle Kconfig entry
to default its value to the one of the old CONFIG_NO_HZ symbol
if present.
Namely we want:
config NO_HZ # old obsolete dynticks idle symbol
bool
config NO_HZ_IDLE # new dynticks idle symbol
default NO_HZ
However Kconfig prevents this to work if the old symbol
is not visible. And this is currently the case because
NO_HZ lacks a title in order to show it in make oldconfig
and alike.
To fix this, bring a minimal title and help text to the
obsolete Kconfig entry that explains its purpose. This
makes the "defaulting" to work.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
perf_trace_buf_prepare() + perf_trace_buf_submit() make no sense
if this task/CPU has no active counters. Change uprobe_perf_print()
to return if hlist_empty(call->perf_events).
Note: this is not uprobe-specific, we can change other users too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Pull {timer,irq,core} fixes from Thomas Gleixner:
- timer: bug fix for a cpu hotplug race.
- irq: single bugfix for a wrong return value, which prevents the
calling function to invoke the software fallback.
- core: bugfix which plugs two race confitions which can cause hotplug
per cpu threads to end up on the wrong cpu.
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hrtimer: Don't reinitialize a cpu_base lock on CPU_UP
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip: gic: fix irq_trigger return
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kthread: Prevent unpark race which puts threads on the wrong cpu
Now that we do sort the __extable at build time, we actually are
interested only in the case where we still do need to sort it.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: David Daney <david.daney@cavium.com>
Link: http://lkml.kernel.org/r/1366023109-12098-1-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Trinity discovered that we fail to check all 64 bits of
attr.config passed by user space, resulting to out-of-bounds
access of the perf_swevent_enabled array in
sw_perf_event_destroy().
Introduced in commit b0a873ebb ("perf: Register PMU
implementations").
Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: davej@redhat.com
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Link: http://lkml.kernel.org/r/1365882554-30259-1-git-send-email-tt.rantala@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's not used, and it can be retrieved via cgrp->root->top_cgroup.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We don't export any symbols > 128 characters, but if we did then
kallsyms_expand_symbol() would overflow the buffer handed to it.
So we need check destination buffer length when copying.
the related test:
if we define an EXPORT function which name more than 128.
will panic when call kallsyms_lookup_name by init_kprobes on booting.
after check the length (provide this patch), it is ok.
Implementaion:
add additional destination buffer length parameter (maxlen)
if uncompressed string is too long (>= maxlen), it will be truncated.
not check the parameters whether valid, since it is a static function.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.
As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility. This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors. As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.
Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default. Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.
This patch introduces the mount option and changes the following
behaviors in cgroup core.
* Mount options "noprefix" and "clone_children" are disallowed. Also,
cgroupfs file cgroup.clone_children is not created.
* When mounting an existing superblock, mount options should match.
This is currently pretty crazy. If one mounts a cgroup, creates a
subdirectory, unmounts it and then mount it again with different
option, it looks like the new options are applied but they aren't.
* Remount is disallowed.
The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.
v2: Dropped unnecessary explicit file permission setting sane_behavior
cftype entry as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
While controllers shouldn't be accessing cgroupfs_root directly, it
being hidden inside kern/cgroup.c makes somethings pretty silly. This
makes routing hierarchy-wide settings which need to be visible to
controllers cumbersome.
We're gonna add another hierarchy-wide setting which needs to be
accessed from controllers. Move cgroupfs_root and its flags to the
header file so that we can access root settings with inline helpers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
There's no reason to be using bitops, which tends to be more
cumbersome, to handle root flags. Convert them to masks. Also, as
they'll be moved to include/linux/cgroup.h and it's generally a good
idea, add CGRP_ prefix.
Note that flags are assigned from (1 << 1). The first bit will be
used by a flag which will be added soon.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Changing uid/gid/projid mappings doesn't change your id within the
namespace; it reconfigures the namespace. Unprivileged programs should
*not* be able to write these files. (We're also checking the privileges
on the wrong task.)
Given the write-once nature of these files and the other security
checks, this is likely impossible to usefully exploit.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
When we require privilege for setting /proc/<pid>/uid_map or
/proc/<pid>/gid_map no longer allow an unprivileged user to
open the file and pass it to a privileged program to write
to the file.
Instead when privilege is required require both the opener and the
writer to have the necessary capabilities.
I have tested this code and verified that setting /proc/<pid>/uid_map
fails when an unprivileged user opens the file and a privielged user
attempts to set the mapping, that unprivileged users can still map
their own id, and that a privileged users can still setup an arbitrary
mapping.
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Pull scheduler fixes from Ingo Molnar:
"Misc fixlets"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cputime: Fix accounting on multi-threaded processes
sched/debug: Fix sd->*_idx limit range avoiding overflow
sched_clock: Prevent 64bit inatomicity on 32bit systems
sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s
doing: echo 1234 | tee -a /sys/kernel/debug/tracing/set_ftrace_pid
Luckily, this can only be done by root, but still is a nasty bug.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJRaK2+AAoJEOdOSU1xswtMw48IAJPcSNMl1+epx5cPw8pwf+y6
YYvs/Ud3BMPBL+mpNPGNFWY+dWJsAtCtAgkLi0WgdL+b9iPNZrmQqqcP5xWV4uKV
vRX2SPCQcyEn5keNnFdN3fN1R0+Gj4V8kLvxPqugzNrO9EHejx+TJFWjrONzkcSy
g90lY45jfGWW0OS4GuSwHFhKDgcx8/kgb4Whv+xrKzTuX2QkU1BhG9WPsjiHWiL5
WRYjC4LWafrWaPd4cIkzMqj1eU/hL8BkiLLQHM1Tw8yD7t8OPzgmuJMZEh6Cx1iW
/Xrm5QkNEcqQ/vSAC6aWUi22VEgRYDLg8WjngwuMgY1Qa3LE2ex8cUDyk7lJbas=
=SFA8
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.9-rc-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fixes from Steven Rostedt:
"Namhyung Kim found and fixed a bug that can crash the kernel by simply
doing: echo 1234 | tee -a /sys/kernel/debug/tracing/set_ftrace_pid
Luckily, this can only be done by root, but still is a nasty bug."
* tag 'trace-fixes-v3.9-rc-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Move ftrace_filter_lseek out of CONFIG_DYNAMIC_FTRACE section
tracing: Fix possible NULL pointer dereferences
While reimplementing cgroup_path(), 65dff759d2 ("cgroup: fix
cgroup_path() vs rename() race") introduced a bug where the path of a
non-root cgroup would have two slahses at the beginning, which is
caused by treating the root cgroup which has the name '/' like
non-root cgroups.
$ grep systemd /proc/self/cgroup
1:name=systemd://user/root/1
Fix it by special casing root cgroup case and not looping over it in
the normal path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Nothing is using it yet, but this will allow us to delay the open-time
checks to use time, without breaking the normal UNIX permission
semantics where permissions are determined by the opener (and the file
descriptor can then be passed to a different process, or the process can
drop capabilities).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
uprobe_perf_print() passes addr=ip to perf_trace_buf_submit() for
no reason. This sets perf_sample_data->addr for PERF_SAMPLE_ADDR,
we already have perf_sample_data->ip initialized if PERF_SAMPLE_IP.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Finally change create_trace_uprobe() to check if argv[0][0] == 'r'
and pass the correct "is_ret" to alloc_trace_uprobe().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
Change probes_seq_show() and print_uprobe_event() to check
is_ret_probe() and print the correct data.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
Change uprobe_event_define_fields(), and __set_print_fmt() to check
is_ret_probe() and use the appropriate format/fields.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
Change uprobe_trace_print() and uprobe_perf_print() to check
is_ret_probe() and fill ring_buffer_event accordingly.
Also change uprobe_trace_func() and uprobe_perf_func() to not
_print() if is_ret_probe() is true. Note that we keep ->handler()
nontrivial even for uretprobe, we need this for filtering and for
other potential extensions.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
Create the new functions we need to support uretprobes, and change
alloc_trace_uprobe() to initialize consumer.ret_handler if the new
"is_ret" argument is true. Curently this argument is always false,
so the new code is never called and is_ret_probe(tu) is false too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
Extract the output code from uprobe_trace_func() and uprobe_perf_func()
into the new helpers, they will be used by ->ret_handler() too. We also
add the unused "unsigned long func" argument in advance, to simplify the
next changes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
struct uprobe_trace_entry_head has a single member for reporting,
"unsigned long ip". If we want to support uretprobes we need to
create another struct which has "func" and "ret_ip" and duplicate
a lot of functions, like trace_kprobe.c does.
To avoid this copy-and-paste horror we turn ->ip into ->vaddr[]
and add couple of trivial helpers to calculate sizeof/data. This
uglifies the code a bit, but this allows us to avoid a lot more
complications later, when we add the support for ret-probes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
uprobe_trace_func() is never called with irqs or preemption
disabled, no need to ask preempt_count() or local_save_flags().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Anton Arapov <anton@redhat.com>
seq_print_ip_sym(ip) in print_uprobe_event() is pointless,
kallsyms_lookup(ip) can not resolve a user-space address.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
uprobe_trace_func() and uprobe_perf_func() do not need task_pt_regs(),
we already have "struct pt_regs *regs".
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
Unlike the kretprobes we can't trust userspace, thus must have
protection from user space attacks. User-space have "unlimited"
stack, and this patch limits the return probes nestedness as a
simple remedy for it.
Note that this implementation leaks return_instance on siglongjmp
until exit()/exec().
The intention is to have KISS and bare minimum solution for the
initial implementation in order to not complicate the uretprobes
code.
In the future we may come up with more sophisticated solution that
remove this depth limitation. It is not easy task and lays beyond
this patchset.
Signed-off-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Uretprobe handlers are invoked when the trampoline is hit, on completion
the trampoline is replaced with the saved return address and the uretprobe
instance deleted.
TODO: handle_trampoline() assumes that ->return_instances is always valid.
We should teach it to handle longjmp() which can invalidate the pending
return_instance's. This is nontrivial, we will try to do this in a separate
series.
Signed-off-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
When a uprobe with return probe consumer is hit, prepare_uretprobe()
function is invoked. It creates return_instance, hijacks return address
and replaces it with the trampoline.
* Return instances are kept as stack per uprobed task.
* Return instance is chained, when the original return address is
trampoline's page vaddr (e.g. recursive call of the probed function).
Signed-off-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Allocate trampoline page, as the very first one in uprobed
task xol area, and fill it with breakpoint opcode.
Also introduce get_trampoline_vaddr() helper, to wrap the
trampoline address extraction from area->vaddr. That removes
confusion and eases the debug experience in case ->vaddr
notion will be changed.
Signed-off-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
It seems that function profiler's hash size is fixed at 1024. Add and
use FTRACE_PROFILE_HASH_BITS instead and update hash size macro.
Link: http://lkml.kernel.org/r/1365551750-4504-1-git-send-email-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ftrace_graph_count can be decreased with a "!" pattern, so that
the enabled flag should be updated too.
Link: http://lkml.kernel.org/r/1365663698-2413-1-git-send-email-namhyung@kernel.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As ftrace_filter_lseek is now used with ftrace_pid_fops, it needs to
be moved out of the #ifdef CONFIG_DYNAMIC_FTRACE section as the
ftrace_pid_fops is defined when DYNAMIC_FTRACE is not.
Cc: stable@vger.kernel.org
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently set_ftrace_pid and set_graph_function files use seq_lseek
for their fops. However seq_open() is called only for FMODE_READ in
the fops->open() so that if an user tries to seek one of those file
when she open it for writing, it sees NULL seq_file and then panic.
It can be easily reproduced with following command:
$ cd /sys/kernel/debug/tracing
$ echo 1234 | sudo tee -a set_ftrace_pid
In this example, GNU coreutils' tee opens the file with fopen(, "a")
and then the fopen() internally calls lseek().
Link: http://lkml.kernel.org/r/1365663302-2170-1-git-send-email-namhyung@kernel.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This reverts commit 84cfb6ab48. There
are scheduled changes which make use of the removed callback.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rami Rosen <ramirose@gmail.com>
Cc: Li Zefan <lizefan@huawei.com>
audit_enabled has already been exported in
include/linux/audit.h. and kernel/audit.h
includes include/linux/audit.h, no need to
export aduit_enabled again in kernel/audit.h
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
The smpboot threads rely on the park/unpark mechanism which binds per
cpu threads on a particular core. Though the functionality is racy:
CPU0 CPU1 CPU2
unpark(T) wake_up_process(T)
clear(SHOULD_PARK) T runs
leave parkme() due to !SHOULD_PARK
bind_to(CPU2) BUG_ON(wrong CPU)
We cannot let the tasks move themself to the target CPU as one of
those tasks is actually the migration thread itself, which requires
that it starts running on the target cpu right away.
The solution to this problem is to prevent wakeups in park mode which
are not from unpark(). That way we can guarantee that the association
of the task to the target cpu is working correctly.
Add a new task state (TASK_PARKED) which prevents other wakeups and
use this state explicitly for the unpark wakeup.
Peter noticed: Also, since the task state is visible to userspace and
all the parked tasks are still in the PID space, its a good hint in ps
and friends that these tasks aren't really there for the moment.
The migration thread has another related issue.
CPU0 CPU1
Bring up CPU2
create_thread(T)
park(T)
wait_for_completion()
parkme()
complete()
sched_set_stop_task()
schedule(TASK_PARKED)
The sched_set_stop_task() call is issued while the task is on the
runqueue of CPU1 and that confuses the hell out of the stop_task class
on that cpu. So we need the same synchronizaion before
sched_set_stop_task().
Reported-by: Dave Jones <davej@redhat.com>
Reported-and-tested-by: Dave Hansen <dave@sr71.net>
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Acked-by: Peter Ziljstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: dhillf@gmail.com
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
- System reboot/halt fix related to CPU offline ordering
from Huacai Chen.
- intel_pstate driver fix for a delay time computation error
occasionally crashing systems using it from Dirk Brandewie.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABAgAGBQJRZynOAAoJEKhOf7ml8uNsYWcQAIZIps7Ivn2+r3ENL+jhTohx
ErEz/cu/YIS/TnDzO3GO+Yo9CcXjUebMWqefIC//YK/K+tNepVOLovthTGiA/X36
23RDRrF1hqZlEgiEfFpuXiyq9u33CbUCYt75tsBXhxJkxeG7J7JfiG4AUh8dED4B
nUCbQ4jWM7r9DYJFl2gjDkFt1SjG/UbxcN9Kua9v4zfJil9fKp9093HHYBHH3a2n
zXlAE7CskXrNOepwp9Efzu5uPU3gbkIiQdKxvUs91remAcZ3fMsbz8CerZlgfy1S
+3f4AuU9i2AXeYI5fanhLo6Mwm8jqBvZ8ZE4Fh/EuQs9eHk7VuRsy7n22zaVeU0A
efaldd/pdP7KbSv5Wrs8adQr3GcRHkuHnMGhTlp41tfR8gJfpZUrK3/6h/jnIPRC
1UnBAF4K67v85fBO6gnC8UhEp3MXXXZoPtPByGILxj34KVn+oHzrVgE+8+ugv7HM
ZJ5jobYPWrxI2lZv5kuBdHCVg2TAC3YUz2aev8cEhIo4vdcIC2cofVDyAcN9ArqF
aF6fcNr6Rgu/M6bB2bP/zbhmDApr8H8z952jss51gprJ+IiKNUh9daiFnYw+o391
9VVTolC7k6P4pXTbtgqEFDLTJ0dKD8i/J4RLHwIsX7jzVgLctyqKZsNXskovjH4p
jqIxu/1SPxR2dtBziUEH
=bkFT
-----END PGP SIGNATURE-----
Merge tag 'pm-3.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
- System reboot/halt fix related to CPU offline ordering from Huacai
Chen.
- intel_pstate driver fix for a delay time computation error
occasionally crashing systems using it from Dirk Brandewie.
* tag 'pm-3.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()
cpufreq / intel_pstate: Set timer timeout correctly
RHBZ: 785936
If the audit system collects a record about one process sending a signal
to another process it includes in that collection the 'secid' or 'an int
used to represet an LSM label.' If there is no LSM enabled it will
collect a 0. The problem is that when we attempt to print that record
we ask the LSM to convert the secid back to a string. Since there is no
LSM it returns EOPNOTSUPP.
Most code in the audit system checks if the secid is 0 and does not
print LSM info in that case. The signal information code however forgot
that check. Thus users will see a message in syslog indicating that
converting the sid to string failed. Add the right check.
Signed-off-by: Eric Paris <eparis@redhat.com>
Userspace parsing libraries assume that msg= is only for userspace audit
records, not for user tty records. Make this consistent with the other
tty records.
Reported-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
When compiling kernel with -jN (N > 1), all warning/error messages
printed while openssl is generating key pair may get mixed dots and
other symbols openssl sends to stderr. This patch makes sure openssl
logs go to default stdout.
Example of the garbage on stderr:
crypto/anubis.c:581: warning: ‘inter’ is used uninitialized in this function
Generating a 4096 bit RSA private key
.........
drivers/gpu/drm/i915/i915_gem_gtt.c: In function ‘gen6_ggtt_insert_entries’:
drivers/gpu/drm/i915/i915_gem_gtt.c:440: warning: ‘addr’ may be used uninitialized in this function
.net/mac80211/tx.c: In function ‘ieee80211_subif_start_xmit’:
net/mac80211/tx.c:1780: warning: ‘chanctx_conf’ may be used uninitialized in this function
..drivers/isdn/hardware/mISDN/hfcpci.c: In function ‘hfcpci_softirq’:
.....drivers/isdn/hardware/mISDN/hfcpci.c:2298: warning: ignoring return value of ‘driver_for_each_device’, declared with attribute warn_unused_result
Signed-off-by: David Cohen <david.a.cohen@intel.com>
Reviewed-by: mark gross <mark.gross@intel.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If the function profiler fails to allocate memory for everything,
it will do a double free on the same pointer which can cause a panic.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJRZdpLAAoJEOdOSU1xswtMyKgH/12ep1nFAYvXQQ04vcV3stCV
7vgk6oDMAGSYgwV2eNUbHNm2zkQBifFxUWLqWyzCd9t4RZUiIv5QHd2a+N2Ta+Xp
Do8zhwod3vzSaZsM3JvQRK5q8U6R72dqroPiv+lJ+jh7cIPdHCm87P+ZPYgAgpfv
6J80Vk34q/HdEGEmNuQgLzgfB+sfld/Ob6Te69f1rmzqCfHCytY1i3R0iPWvaI/v
B8R5cosjDhm0hAljsFlZb2Vl1jb89ByTgX3dL5Ph3O+hnHPCWE+ZQtbLCaOBV9F0
z8glXmAu2XVhv++0d21ul/TddQhVYQYF+ZMawxUlnLVKZ/J66c3l9Omhwf33Wz4=
=+NqN
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-3.9-rc-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Namhyung Kim fixed a long standing bug that can cause a kernel panic.
If the function profiler fails to allocate memory for everything, it
will do a double free on the same pointer which can cause a panic"
* tag 'trace-fixes-3.9-rc-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix double free when function profile init failed
> In function audit_alloc_context(), use kzalloc, instead of kmalloc+memset. Patch also renames audit_zero_context() to
> audit_set_context(), to represent it's inner workings properly.
Fair enough. I'd go futher...
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Paris <eparis@redhat.com>
In function audit_alloc_context(), use kzalloc, instead of kmalloc+memset. Patch also renames audit_zero_context() to
audit_set_context(), to represent it's inner workings properly.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
perf_event is one of a couple remaining cgroup controllers with broken
hierarchy support. Converting it to support hierarchy is almost
trivial. The only thing necessary is to consider a task belonging to
a descendant cgroup as a match. IOW, if the cgroup of the currently
executing task (@cpuctx->cgrp) equals or is a descendant of the
event's cgroup (@event->cgrp), then the event should be enabled.
Implement hierarchy support and remove .broken_hierarchy tag along
with the incorrect comment on what needs to be done for hierarchy
support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Stephane Eranian <eranian@google.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
A couple controllers want to determine whether two cgroups are in
ancestor/descendant relationship. As it's more likely that the
descendant is the primary subject of interest and there are other
operations focusing on the descendants, let's ask is_descendent rather
than is_ancestor.
Implementation is trivial as the previous patch guarantees that all
ancestors of a cgroup stay accessible as long as the cgroup is
accessible.
tj: Removed depth optimization, renamed from cgroup_is_ancestor(),
rewrote descriptions.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Suppose we rmdir a cgroup and there're still css refs, this cgroup won't
be freed. Then we rmdir the parent cgroup, and the parent is freed
immediately due to css ref draining to 0. Now it would be a disaster if
the still-alive child cgroup tries to access its parent.
Make sure this won't happen.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The bind() method of cgroup_subsys is not used in any of the
controllers (cpuset, freezer, blkio, net_cls, memcg, net_prio,
devices, perf, hugetlb, cpu and cpuacct)
tj: Removed the entry on ->bind() from
Documentation/cgroups/cgroups.txt. Also updated a couple
paragraphs which were suggesting that dynamic re-binding may be
implemented. It's not gonna.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
__audit_socketcall is an extern function.
better to check its parameters by itself.
also can return error code, when fail (find invalid parameters).
also use macro instead of real hard code number
also give related comments for it.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
[eparis: fix the return value when !CONFIG_AUDIT]
Signed-off-by: Eric Paris <eparis@redhat.com>
filename should be destroyed via final_putname() instead of __putname()
Otherwise this result in following BUGON() in case of long names:
kernel BUG at mm/slab.c:3006!
Call Trace:
kmem_cache_free+0x1c1/0x850
audit_putname+0x88/0x90
putname+0x73/0x80
sys_symlinkat+0x120/0x150
sys_symlink+0x16/0x20
system_call_fastpath+0x16/0x1b
Introduced-in: 7950e3852
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
The cpuacct split caused this build failure on UML:
kernel/sched/cpuacct.c:94:2: error: implicit declaration of function 'ERR_PTR'
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The only user was cpuacct.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5155385A.4040207@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now we're guaranteed when cpuacct_charge() and
cpuacct_account_field() are called, cpuacct has already been
properly initialized, so we no longer need those checks.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5155384C.7000508@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Initialize cpuacct before the scheduler is functioning, so when
cpuacct_charge() and cpuacct_account_field() are called,
task_ca() won't return NULL.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5155383F.8000005@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now we don't need cpuacct_init(), and instead we just initialize
root_cpuacct when it's defined.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51553834.9090701@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is a preparation, so later we can initialize cpuacct
earlier.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51553822.5000403@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now most of the code in cpuacct.h can be moved to cpuacct.c
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/515536D5.2080401@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is a micro optimazation for a hot path.
- We don't need to check if @ca returned from task_ca() is NULL.
- We don't need to check if @ca returned from parent_ca() is NULL.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/515536B7.6060602@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is a micro optimization for the hot path.
- We don't need to check if @ca is NULL in parent_ca().
- We don't need to check if @ca is NULL in the beginning of the for loop.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/515536A9.5000700@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So we can remove open-coded cpuacct code in cputime.c.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51553692.9060008@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So we don't open-coded initialization of cpuacct in core.c.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51553687.1060906@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add cpuacct.h and let sched.h include it.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5155367B.2060506@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A comment in function rebalance_domains() mentions
arch_init_sched_domains(), but that function does not exist
anymore. The proper function is init_sched_domains().
Signed-off-by: Libin <huawei.libin@huawei.com>
Cc: <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1364814841-49156-1-git-send-email-huawei.libin@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull RCU updates from Paul E. McKenney:
* Remove restrictions on no-CBs CPUs, make RCU_FAST_NO_HZ
take advantage of numbered callbacks, do additional callback
accelerations based on numbered callbacks. Posted to LKML
at https://lkml.org/lkml/2013/3/18/960.
* RCU documentation updates. Posted to LKML at
https://lkml.org/lkml/2013/3/18/570.
* Miscellaneous fixes. Posted to LKML at
https://lkml.org/lkml/2013/3/18/594.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
At this point tsk_cache_hot is always true, so no need to check it.
Signed-off-by: Zhang Hang <bob.zhanghang@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51650107.9040606@huawei.com
[ Also remove unnecessary schedstat #ifdefs. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On the failure path, stat->start and stat->pages will refer same page.
So it'll attempt to free the same page again and get kernel panic.
Link: http://lkml.kernel.org/r/1364820385-32027-1-git-send-email-namhyung@kernel.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
memory allocated by kmem_cache_alloc() should be freed using
kmem_cache_free(), not kfree().
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data. Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use strlcpy() instead of strncpy() as it will always add a '\0'
to the end of the string even if the buffer is smaller than what
is being copied.
Link: http://lkml.kernel.org/r/51624254.30301@asianux.com
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
fixes a long time minor bug.
The first patch fixes a race that can happen if the user switches
from the irqsoff tracer to another tracer. If a irqs off latency is
detected, it will try to use the snapshot buffer, but the new tracer
wont have it allocated. There's a nasty warning that gets printed and
the trace is ignored. Nothing crashes, just a nasty WARN_ON is shown.
The second patch fixes an issue where if the sysctl is used to disable
and enable function tracing, it can put the function tracing into an
unstable state.
The third patch fixes an issue with perf using the function tracer.
An update was done, where the stub function could be called during
the perf function tracing, and that stub function wont have the
"control" flag set and cause a nasty warning when running perf.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJRYyyXAAoJEOdOSU1xswtMMtQH/0Ks494IyC9zAcSFZXJGagc2
bV1k2WrHUuXZnDEP3DIrwS87YwYOYD6l/7TW7AUc2AsFIgwsQ8tP+riI2FZVduAs
LLKR3NxE8B8hi+QS7fbEXea6jcRX2I+gnsv8bLenDVbliCWs1wZbSo8jbyOFjpKa
AWRpjIIBmKYB/dGn87YVOLAYHiMUO5WScKwJV0bCL9m5r2/7a1nu1j8KiQ9N0Vun
43jimIHYDlI/eSOGNIJPFAc/zjPXlPDFrpGcPg6wgUDfwSO0Cbz2PM46uxen+s91
Z4mbiqEONSTcl/wKYx9s6zRY+brkvP3AK0d1x1Al+TkTeFeaVPkTwmKSI/e46ow=
=9Ide
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-3.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"This includes three fixes. Two fix features added in 3.9 and one
fixes a long time minor bug.
The first patch fixes a race that can happen if the user switches from
the irqsoff tracer to another tracer. If a irqs off latency is
detected, it will try to use the snapshot buffer, but the new tracer
wont have it allocated. There's a nasty warning that gets printed and
the trace is ignored. Nothing crashes, just a nasty WARN_ON is shown.
The second patch fixes an issue where if the sysctl is used to disable
and enable function tracing, it can put the function tracing into an
unstable state.
The third patch fixes an issue with perf using the function tracer.
An update was done, where the stub function could be called during the
perf function tracing, and that stub function wont have the "control"
flag set and cause a nasty warning when running perf."
* tag 'trace-fixes-3.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Do not call stub functions in control loop
ftrace: Consistently restore trace function on sysctl enabling
tracing: Fix race with update_max_tr_single and changing tracers
One can trigger an overflow when using ktime_add_ns() on a 32bit
architecture not supporting CONFIG_KTIME_SCALAR.
When passing a very high value for u64 nsec, e.g. 7881299347898368000
the do_div() function converts this value to seconds (7881299347) which
is still to high to pass to the ktime_set() function as long. The result
in is a negative value.
The problem on my system occurs in the tick-sched.c,
tick_nohz_stop_sched_tick() when time_delta is set to
timekeeping_max_deferment(). The check for time_delta < KTIME_MAX is
valid, thus ktime_add_ns() is called with a too large value resulting in
a negative expire value. This leads to an endless loop in the ticker code:
time_delta: 7881299347898368000
expires = ktime_add_ns(last_update, time_delta)
expires: negative value
This fix caps the value to KTIME_MAX.
This error doesn't occurs on 64bit or architectures supporting
CONFIG_KTIME_SCALAR (e.g. ARM, x86-32).
Cc: stable@vger.kernel.org
Signed-off-by: David Engraf <david.engraf@sysgo.com>
[jstultz: Minor tweaks to commit message & header]
Signed-off-by: John Stultz <john.stultz@linaro.org>
The settimeofday01 test in the LTP testsuite effectively does
gettimeofday(current time);
settimeofday(Jan 1, 1970 + 100 seconds);
settimeofday(current time);
This test causes a stack trace to be displayed on the console during the
setting of timeofday to Jan 1, 1970 + 100 seconds:
[ 131.066751] ------------[ cut here ]------------
[ 131.096448] WARNING: at kernel/time/clockevents.c:209 clockevents_program_event+0x135/0x140()
[ 131.104935] Hardware name: Dinar
[ 131.108150] Modules linked in: sg nfsv3 nfs_acl nfsv4 auth_rpcgss nfs dns_resolver fscache lockd sunrpc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables kvm_amd kvm sp5100_tco bnx2 i2c_piix4 crc32c_intel k10temp fam15h_power ghash_clmulni_intel amd64_edac_mod pcspkr serio_raw edac_mce_amd edac_core microcode xfs libcrc32c sr_mod sd_mod cdrom ata_generic crc_t10dif pata_acpi radeon i2c_algo_bit drm_kms_helper ttm drm ahci pata_atiixp libahci libata usb_storage i2c_core dm_mirror dm_region_hash dm_log dm_mod
[ 131.176784] Pid: 0, comm: swapper/28 Not tainted 3.8.0+ #6
[ 131.182248] Call Trace:
[ 131.184684] <IRQ> [<ffffffff810612af>] warn_slowpath_common+0x7f/0xc0
[ 131.191312] [<ffffffff8106130a>] warn_slowpath_null+0x1a/0x20
[ 131.197131] [<ffffffff810b9fd5>] clockevents_program_event+0x135/0x140
[ 131.203721] [<ffffffff810bb584>] tick_program_event+0x24/0x30
[ 131.209534] [<ffffffff81089ab1>] hrtimer_interrupt+0x131/0x230
[ 131.215437] [<ffffffff814b9600>] ? cpufreq_p4_target+0x130/0x130
[ 131.221509] [<ffffffff81619119>] smp_apic_timer_interrupt+0x69/0x99
[ 131.227839] [<ffffffff8161805d>] apic_timer_interrupt+0x6d/0x80
[ 131.233816] <EOI> [<ffffffff81099745>] ? sched_clock_cpu+0xc5/0x120
[ 131.240267] [<ffffffff814b9ff0>] ? cpuidle_wrap_enter+0x50/0xa0
[ 131.246252] [<ffffffff814b9fe9>] ? cpuidle_wrap_enter+0x49/0xa0
[ 131.252238] [<ffffffff814ba050>] cpuidle_enter_tk+0x10/0x20
[ 131.257877] [<ffffffff814b9c89>] cpuidle_idle_call+0xa9/0x260
[ 131.263692] [<ffffffff8101c42f>] cpu_idle+0xaf/0x120
[ 131.268727] [<ffffffff815f8971>] start_secondary+0x255/0x257
[ 131.274449] ---[ end trace 1151a50552231615 ]---
When we change the system time to a low value like this, the value of
timekeeper->offs_real will be a negative value.
It seems that the WARN occurs because an hrtimer has been started in the time
between the releasing of the timekeeper lock and the IPI call (via a call to
on_each_cpu) in clock_was_set() in the do_settimeofday() code. The end result
is that a REALTIME_CLOCK timer has been added with softexpires = expires =
KTIME_MAX. The hrtimer_interrupt() fires/is called and the loop at
kernel/hrtimer.c:1289 is executed. In this loop the code subtracts the
clock base's offset (which was set to timekeeper->offs_real in
do_settimeofday()) from the current hrtimer_cpu_base->expiry value (which
was KTIME_MAX):
KTIME_MAX - (a negative value) = overflow
A simple check for an overflow can resolve this problem. Using KTIME_MAX
instead of the overflow value will result in the hrtimer function being run,
and the reprogramming of the timer after that.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
[jstultz: Tweaked commit subject]
Signed-off-by: John Stultz <john.stultz@linaro.org>
The kauditd_thread() task was started only after the auditd userspace daemon
registers itself with kaudit. This was fine when only auditd consumed messages
from the kaudit netlink unicast socket. With the addition of a multicast group
to that socket it is more convenient to have the thread start on init of the
kaudit kernel subsystem.
Signed-off-by: Richard Guy Briggs <rbriggs@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
The wait queue control code in kauditd_thread() was nested deeper than
necessary. The function has been flattened for better legibility.
Signed-off-by: Richard Guy Briggs <rbriggs@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
The hold queue flush code is an autonomous chunk of code that can be
refactored, removed from kauditd_thread() into flush_hold_queue() and
flattenned for better legibility.
Signed-off-by: Richard Guy Briggs <rbriggs@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
It is useful to extend GID/EGID comparation logic to be able to
match not only the exact EID/EGID values but the group/egroup also.
Signed-off-by: Matvejchikov Ilya <matvejchikov@gmail.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
As commit 40dc166c (PM / Core: Introduce struct syscore_ops for core
subsystems PM) say, syscore_ops operations should be carried with one
CPU on-line and interrupts disabled. However, after commit f96972f2d
(kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()),
syscore_shutdown() is called before disable_nonboot_cpus(), so break
the rules. We have a MIPS machine with a 8259A PIC, and there is an
external timer (HPET) linked at 8259A. Since 8259A has been shutdown
too early (by syscore_shutdown()), disable_nonboot_cpus() runs without
timer interrupt, so it hangs and reboot fails. This patch call
syscore_shutdown() a little later (after disable_nonboot_cpus()) to
avoid reboot failure, this is the same way as poweroff does.
For consistency, add disable_nonboot_cpus() to kernel_halt().
Signed-off-by: Huacai Chen <chenhc@lemote.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The function tracing control loop used by perf spits out a warning
if the called function is not a control function. This is because
the control function references a per cpu allocated data structure
on struct ftrace_ops that is not allocated for other types of
functions.
commit 0a016409e4 "ftrace: Optimize the function tracer list loop"
Had an optimization done to all function tracing loops to optimize
for a single registered ops. Unfortunately, this allows for a slight
race when tracing starts or ends, where the stub function might be
called after the current registered ops is removed. In this case we
get the following dump:
root# perf stat -e ftrace:function sleep 1
[ 74.339105] WARNING: at include/linux/ftrace.h:209 ftrace_ops_control_func+0xde/0xf0()
[ 74.349522] Hardware name: PRIMERGY RX200 S6
[ 74.357149] Modules linked in: sg igb iTCO_wdt ptp pps_core iTCO_vendor_support i7core_edac dca lpc_ich i2c_i801 coretemp edac_core crc32c_intel mfd_core ghash_clmulni_intel dm_multipath acpi_power_meter pcspk
r microcode vhost_net tun macvtap macvlan nfsd kvm_intel kvm auth_rpcgss nfs_acl lockd sunrpc uinput xfs libcrc32c sd_mod crc_t10dif sr_mod cdrom mgag200 i2c_algo_bit drm_kms_helper ttm qla2xxx mptsas ahci drm li
bahci scsi_transport_sas mptscsih libata scsi_transport_fc i2c_core mptbase scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[ 74.446233] Pid: 1377, comm: perf Tainted: G W 3.9.0-rc1 #1
[ 74.453458] Call Trace:
[ 74.456233] [<ffffffff81062e3f>] warn_slowpath_common+0x7f/0xc0
[ 74.462997] [<ffffffff810fbc60>] ? rcu_note_context_switch+0xa0/0xa0
[ 74.470272] [<ffffffff811041a2>] ? __unregister_ftrace_function+0xa2/0x1a0
[ 74.478117] [<ffffffff81062e9a>] warn_slowpath_null+0x1a/0x20
[ 74.484681] [<ffffffff81102ede>] ftrace_ops_control_func+0xde/0xf0
[ 74.491760] [<ffffffff8162f400>] ftrace_call+0x5/0x2f
[ 74.497511] [<ffffffff8162f400>] ? ftrace_call+0x5/0x2f
[ 74.503486] [<ffffffff8162f400>] ? ftrace_call+0x5/0x2f
[ 74.509500] [<ffffffff810fbc65>] ? synchronize_sched+0x5/0x50
[ 74.516088] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
[ 74.522268] [<ffffffff810fbc65>] ? synchronize_sched+0x5/0x50
[ 74.528837] [<ffffffff811041a2>] ? __unregister_ftrace_function+0xa2/0x1a0
[ 74.536696] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
[ 74.542878] [<ffffffff8162402d>] ? mutex_lock+0x1d/0x50
[ 74.548869] [<ffffffff81105c67>] unregister_ftrace_function+0x27/0x50
[ 74.556243] [<ffffffff8111eadf>] perf_ftrace_event_register+0x9f/0x140
[ 74.563709] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
[ 74.569887] [<ffffffff8162402d>] ? mutex_lock+0x1d/0x50
[ 74.575898] [<ffffffff8111e94e>] perf_trace_destroy+0x2e/0x50
[ 74.582505] [<ffffffff81127ba9>] tp_perf_event_destroy+0x9/0x10
[ 74.589298] [<ffffffff811295d0>] free_event+0x70/0x1a0
[ 74.595208] [<ffffffff8112a579>] perf_event_release_kernel+0x69/0xa0
[ 74.602460] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
[ 74.608667] [<ffffffff8112a640>] put_event+0x90/0xc0
[ 74.614373] [<ffffffff8112a740>] perf_release+0x10/0x20
[ 74.620367] [<ffffffff811a3044>] __fput+0xf4/0x280
[ 74.625894] [<ffffffff811a31de>] ____fput+0xe/0x10
[ 74.631387] [<ffffffff81083697>] task_work_run+0xa7/0xe0
[ 74.637452] [<ffffffff81014981>] do_notify_resume+0x71/0xb0
[ 74.643843] [<ffffffff8162fa92>] int_signal+0x12/0x17
To fix this a new ftrace_ops flag is added that denotes the ftrace_list_end
ftrace_ops stub as just that, a stub. This flag is now checked in the
control loop and the function is not called if the flag is set.
Thanks to Jovi for not just reporting the bug, but also pointing out
where the bug was in the code.
Link: http://lkml.kernel.org/r/514A8855.7090402@redhat.com
Link: http://lkml.kernel.org/r/1364377499-1900-15-git-send-email-jovi.zhangwei@huawei.com
Tested-by: WANG Chao <chaowang@redhat.com>
Reported-by: WANG Chao <chaowang@redhat.com>
Reported-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If we reenable ftrace via syctl, we currently set ftrace_trace_function
based on the previous simplistic algorithm. This is inconsistent with
what update_ftrace_function does. So better call that helper instead.
Link: http://lkml.kernel.org/r/5151D26F.1070702@siemens.com
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The commit 34600f0e9 "tracing: Fix race with max_tr and changing tracers"
fixed the updating of the main buffers with the race of changing
tracers, but left out the fix to the updating of just a per cpu buffer.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Recent commit 6fac4829 ("cputime: Use accessors to read task
cputime stats") introduced a bug, where we account many times
the cputime of the first thread, instead of cputimes of all
the different threads.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130404085740.GA2495@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
All idle functions in arch/* are more or less the same, plus minus a
few bugs and extra instrumentation, tickless support and other
optional items.
Implement a generic idle function which resembles the functionality
found in arch/. Provide weak arch_cpu_idle_* functions which can be
overridden by the architecture code if needed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130321215233.646635455@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
For now this calls cpu_idle(), but in the long run we want to move the
cpu bringup code to the core and therefor we add a state argument.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130321215233.583190032@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Move it to a common place. Preparatory patch for implementing
set/clear for the idle need_resched poll implementation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130321215233.446034505@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
For NUL terminated string we always need to set '\0' at the end.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: rostedt@goodmis.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/516243B7.9020405@asianux.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For NUL terminated string we always need to set '\0' at the end.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: rostedt@goodmis.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/51624254.30301@asianux.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For NUL terminated string, always make sure that there's '\0' at the end.
In our case we need a return value, so still use strncpy() and
fix up the tail explicitly.
(strlcpy() returns the size, not the pointer)
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: a.p.zijlstra@chello.nl <a.p.zijlstra@chello.nl>
Cc: paulus@samba.org <paulus@samba.org>
Cc: acme@ghostprotocols.net <acme@ghostprotocols.net>
Link: http://lkml.kernel.org/r/51623E0B.7070101@asianux.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 201c373e8e ("sched/debug: Limit sd->*_idx range on
sysctl") was an incomplete bug fix.
This patch fixes sd->*_idx limit range to [0 ~ CPU_LOAD_IDX_MAX-1]
avoiding array overflow caused by setting sd->*_idx to CPU_LOAD_IDX_MAX
on sysctl.
Signed-off-by: Libin <huawei.libin@huawei.com>
Cc: <jiang.liu@huawei.com>
Cc: <guohanjun@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51626610.2040607@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The sched_clock_remote() implementation has the following inatomicity
problem on 32bit systems when accessing the remote scd->clock, which
is a 64bit value.
CPU0 CPU1
sched_clock_local() sched_clock_remote(CPU0)
...
remote_clock = scd[CPU0]->clock
read_low32bit(scd[CPU0]->clock)
cmpxchg64(scd->clock,...)
read_high32bit(scd[CPU0]->clock)
While the update of scd->clock is using an atomic64 mechanism, the
readout on the remote cpu is not, which can cause completely bogus
readouts.
It is a quite rare problem, because it requires the update to hit the
narrow race window between the low/high readout and the update must go
across the 32bit boundary.
The resulting misbehaviour is, that CPU1 will see the sched_clock on
CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
to this bogus timestamp. This stays that way due to the clamping
implementation for about 4 seconds until the synchronization with
CLOCK_MONOTONIC undoes the problem.
The issue is hard to observe, because it might only result in a less
accurate SCHED_OTHER timeslicing behaviour. To create observable
damage on realtime scheduling classes, it is necessary that the bogus
update of CPU1 sched_clock happens in the context of an realtime
thread, which then gets charged 4 seconds of RT runtime, which results
in the RT throttler mechanism to trigger and prevent scheduling of RT
tasks for a little less than 4 seconds. So this is quite unlikely as
well.
The issue was quite hard to decode as the reproduction time is between
2 days and 3 weeks and intrusive tracing makes it less likely, but the
following trace recorded with trace_clock=global, which uses
sched_clock_local(), gave the final hint:
<idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
<idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ...
irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0
<idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80
What happens is that CPU0 goes idle and invokes
sched_clock_idle_sleep_event() which invokes sched_clock_local() and
CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
sched_remote_clock(). The time jump gets propagated to CPU0 via
sched_remote_clock() and stays stale on both cores for ~4 seconds.
There are only two other possibilities, which could cause a stale
sched clock:
1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
wrong value.
2) sched_clock() which reads the TSC returns a sporadic wrong value.
#1 can be excluded because sched_clock would continue to increase for
one jiffy and then go stale.
#2 can be excluded because it would not make the clock jump
forward. It would just result in a stale sched_clock for one jiffy.
After quite some brain twisting and finding the same pattern on other
traces, sched_clock_remote() remained the only place which could cause
such a problem and as explained above it's indeed racy on 32bit
systems.
So while on 64bit systems the readout is atomic, we need to verify the
remote readout on 32bit machines. We need to protect the local->clock
readout in sched_clock_remote() on 32bit as well because an NMI could
hit between the low and the high readout, call sched_clock_local() and
modify local->clock.
Thanks to Siegfried Wulsch for bearing with my debug requests and
going through the tedious tasks of running a bunch of reproducer
systems to generate the debug information which let me decode the
issue.
Reported-by: Siegfried Wulsch <Siegfried.Wulsch@rovema.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
We don't want controllers to assume that the information is officially
available and do funky things with it.
The only user is task_subsys_state_check() which uses it to verify RCU
access context. We can move cgroup_lock_is_held() inside
CONFIG_PROVE_RCU but that doesn't add meaningful protection compared
to conditionally exposing cgroup_mutex.
Remove cgroup_lock_is_held(), export cgroup_mutex iff CONFIG_PROVE_RCU
and use lockdep_is_held() directly on the mutex in
task_subsys_state_check().
While at it, add parentheses around macro arguments in
task_subsys_state_check().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Now that locking interface is unexported, there's no reason to keep
around these thin wrappers. Kill them and use mutex operations
directly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Now that all external cgroup_lock() users are gone, we can finally
unexport the locking interface and prevent future abuse of
cgroup_mutex.
Make cgroup_[un]lock() and cgroup_lock_live_group() static. Also,
cgroup_attach_task() doesn't have any user left and can't be used
without locking interface anyway. Make it static too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup_lock_live_group() and cgroup_attach_task() are scheduled to be
made static. Relocate the former and cgroup_attach_task_all() so that
we don't need forward declarations.
This patch is pure relocation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
When a cpuset becomes empty (no CPU or memory), its tasks are
transferred with the nearest ancestor with execution resources. This
is implemented using cgroup_scan_tasks() with a callback which grabs
cgroup_mutex and invokes cgroup_attach_task() on each task.
Both cgroup_mutex and cgroup_attach_task() are scheduled to be
unexported. Implement cgroup_transfer_tasks() in cgroup proper which
is essentially the same as move_member_tasks_to_cpuset() except that
it takes cgroups instead of cpusets and @to comes before @from like
normal functions with those arguments, and replace
move_member_tasks_to_cpuset() with it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
freeze state is a software suspend state that does not run into
low-level platform callbacks which may interact with BIOS.
And freeze state does not need to disable the processors.
But the current pm_test support misleads users because users
can enter freeze state with pm_test set to TEST_CPUS/TEST_CORE,
while this pm_test setting never takes actions.
So, invalidate TEST_CPUS/TEST_CORE for freeze state in this patch.
Then users will get an error instead, when trying to
enter freeze state with pm_test mode set to TEST_CPUS/TEST_CORE.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Invoke freeze_enter() after suspend_test(TEST_PLATFORM) being invoked.
So when setting /sys/power/pm_test to "platform", it can be used to
check if freeze state is working well after all devices are suspended
and before processors are blocked,
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Daniel writes:
Highlights:
- Imre's for_each_sg_pages rework (now also with the stolen mem backed
case fixed with a hack) plus the drm prime sg list coalescing patch from
Rahul Sharma. I have some follow-up cleanups pending, already acked by
Andrew Morton.
- Some prep-work for the crazy no-pch/display-less platform by Ben.
- Some vlv patches, by far not all (Jesse et al).
- Clean up the HDMI/SDVO #define confusion (Paulo)
- gen2-4 vblank fixes from Ville.
- Unclaimed register warning fixes for hsw (Paulo). More still to come ...
- Complete pageflips which have been stuck in a gpu hang, should prevent
stuck gl compositors (Ville).
- pm patches for vt-switchless resume (Jesse). Note that the i915 enabling
is not (yet) included, that took a bit longer to settle. PM patches are
acked by Rafael Wysocki.
- Minor fixlets all over from various people.
* tag 'drm-intel-next-2013-03-23' of git://people.freedesktop.org/~danvet/drm-intel: (79 commits)
drm/i915: Implement WaSwitchSolVfFArbitrationPriority
drm/i915: Set the VIC in AVI infoframe for SDVO
drm/i915: Kill a strange comment about DPMS functions
drm/i915: Correct sandybrige overclocking
drm/i915: Introduce GEN7_FEATURES for device info
drm/i915: Move num_pipes to intel info
drm/i915: fixup pd vs pt confusion in gen6 ppgtt code
style nit: Align function parameter continuation properly.
drm/i915: VLV doesn't have HDMI on port C
drm/i915: DSPFW and BLC regs are in the display offset range
drm/i915: set conservative clock gating values on VLV v2
drm/i915: fix WaDisablePSDDualDispatchEnable on VLV v2
drm/i915: add more VLV IDs
drm/i915: use VLV DIP routines on VLV v2
drm/i915: add media well to VLV force wake routines v2
drm/i915: don't use plane pipe select on VLV
drm: modify pages_to_sg prime helper to create optimized SG table
drm/i915: use for_each_sg_page for setting up the gtt ptes
drm/i915: create compact dma scatter lists for gem objects
drm/i915: handle walking compact dma scatter lists
...
Shorten the seqcount write hold region to the actual update of the
timekeeper and the related data (e.g vsyscall).
On a contemporary x86 system this reduces the maximum latencies on
Preempt-RT from 8us to 4us on the non-timekeeping cores.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Use the shadow timekeeper to do the update_wall_time() adjustments and
then copy it over to the real timekeeper.
Keep the shadow timekeeper in sync when updating stuff outside of
update_wall_time().
This allows us to limit the timekeeper_seq hold time to the update of
the real timekeeper and the vsyscall data in the next patch.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
For calculating the new timekeeper values store the new cycle_last
value in the timekeeper and update the clock->cycle_last just when we
actually update the new values.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
For implementing a shadow timekeeper and a split calculation/update
region we need to store the cycle_last value in the timekeeper and
update the value in the clocksource struct only in the update region.
Add the extra storage to the timekeeper.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
In order to properly handle the NTP state in future changes to the
timekeeping lock management, this patch moves the management of
all of the ntp state under the timekeeping locks.
This allows us to remove the ntp_lock.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Since we are taking the timekeeping locks, just go ahead
and update any tai change directly, rather then dropping
the lock and calling a function that will just take it again.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
In moving the NTP state to be protected by the timekeeping locks,
be sure to acquire the timekeeping locks prior to calling
ntp functions.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Since ADJ_SETOFFSET adjusts the timekeeping state, process
it as part of the top level do_adjtimex() function in
timekeeping.c.
This avoids deadlocks that could occur once we change the
ntp locking rules.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
In order to change the locking rules, we need to provide
the timespec and tai values rather then having the ntp
logic acquire these values itself.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Move logic that does not need the ntp state to be done
in the timekeeping do_adjtimex() call.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
In preparation for changing the ntp locking rules, move
do_adjtimex and hardpps accessor functions to timekeeping.c,
but keep the code logic in ntp.c.
This patch also introduces a ntp_internal.h file so timekeeping
specific interfaces of ntp.c can be more limitedly shared with
timekeeping.c.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Split out the timex validation done in do_adjtimex into a separate
function. This will help simplify logic in following patches.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
destroy_workqueue() performs several sanity checks before proceeding
with destruction of a workqueue. One of the checks verifies that
refcnt of each pwq (pool_workqueue) is over 1 as at that point there
should be no in-flight work items and the only holder of pwq refs is
the workqueue itself.
This worked fine as a workqueue used to hold only one reference to its
pwqs; however, since 4c16bd327c ("workqueue: implement NUMA affinity
for unbound workqueues"), a workqueue may hold multiple references to
its default pwq triggering this sanity check spuriously.
Fix it by not triggering the pwq->refcnt assertion on default pwqs.
An example spurious WARN trigger follows.
WARNING: at kernel/workqueue.c:4201 destroy_workqueue+0x6a/0x13e()
Hardware name: 4286C12
Modules linked in: sdhci_pci sdhci mmc_core usb_storage i915 drm_kms_helper drm i2c_algo_bit i2c_core video
Pid: 361, comm: umount Not tainted 3.9.0-rc5+ #29
Call Trace:
[<c04314a7>] warn_slowpath_common+0x7c/0x93
[<c04314e0>] warn_slowpath_null+0x22/0x24
[<c044796a>] destroy_workqueue+0x6a/0x13e
[<c056dc01>] ext4_put_super+0x43/0x2c4
[<c04fb7b8>] generic_shutdown_super+0x4b/0xb9
[<c04fb848>] kill_block_super+0x22/0x60
[<c04fb960>] deactivate_locked_super+0x2f/0x56
[<c04fc41b>] deactivate_super+0x2e/0x31
[<c050f1e6>] mntput_no_expire+0x103/0x108
[<c050fdce>] sys_umount+0x2a2/0x2c4
[<c050fe0e>] sys_oldumount+0x1e/0x20
[<c085ba4d>] sysenter_do_call+0x12/0x38
tj: Rewrote description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Change write_opcode() to use copy_highpage() + copy_to_page()
and simplify the code.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Extract the kmap_atomic/memcpy/kunmap_atomic code from
xol_get_insn_slot() into the new simple helper, copy_to_page().
It will have more users soon.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
__copy_insn(filp) can only be called after valid_vma() returns T,
vma->vm_file passed as "filp" can not be NULL.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Change __copy_insn() to use copy_from_page() and simplify the code.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
No functional changes. Rename copy_opcode() into copy_from_page() and
add the new "int len" argument to make it more more generic for the
new users.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Some architectures like powerpc have multiple variants of the trap
instruction. Introduce an additional helper is_trap_insn() for run-time
handling of non-uprobe traps on such architectures.
While there, change is_swbp_at_addr() to is_trap_at_addr() for reading
clarity.
With this change, the uprobe registration path will supercede any trap
instruction inserted at the requested location, while taking care of
delivering the SIGTRAP for cases where the trap notification came in
for an address without a uprobe. See [1] for a more detailed explanation.
[1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2013-March/104771.html
This change was suggested by Oleg Nesterov.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cleanup. Now that we have f_inode/file_inode() we can use it instead
of vm_file->f_mapping->host.
This should not make any difference for uprobes, but in theory this
change is more correct. We use this inode as a key, to compare it
with uprobe->inode set by uprobe_register(inode), and the caller uses
d_inode.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
This patch removes unused parameter from cgroup_task_migrate().
Signed-off-by: Kevin Wilson <wkevils@gmail.com>
Acked-by: Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Given that we apply a few restrictions on the full dynticks
CPUs range (keep an online timekeeper oustide the range,
then in the future have the range be an RCU nocb CPUs subset),
let's print the final resulting range of full dynticks CPUs to
the user so that he knows what's really going to run.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Now the user has the choice between three implementations of
the timer tick:
* Static periodic tick
* Idle dynticks
* Full dynticks
At least for now, these are mutually exclusive choices, so
let's rely on the proper Kconfig feature to display these
to the user.
A new entry CONFIG_NO_HZ_IDLE is created and the old
CONFIG_NO_HZ maps to it for config file backward compatibility.
The old name was too general now that we have more
granular dynticks implementations.
While at it, add some explanation to help the user on
his decision between the 3 entries.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
We are planning to convert the dynticks Kconfig options layout
into a choice menu. The user must be able to easily pick
any of the following implementations: constant periodic tick,
idle dynticks, full dynticks.
As this implies a mutual exclusion, the two dynticks implementions
need to converge on the selection of a common Kconfig option in order
to ease the sharing of a common infrastructure.
It would thus seem pretty natural to reuse CONFIG_NO_HZ to
that end. It already implements all the idle dynticks code
and the full dynticks depends on all that code for now.
So ideally the choice menu would propose CONFIG_NO_HZ_IDLE and
CONFIG_NO_HZ_EXTENDED then both would select CONFIG_NO_HZ.
On the other hand we want to stay backward compatible: if
CONFIG_NO_HZ is set in an older config file, we want to
enable CONFIG_NO_HZ_IDLE by default.
But we can't afford both at the same time or we run into
a circular dependency:
1) CONFIG_NO_HZ_IDLE and CONFIG_NO_HZ_EXTENDED both select
CONFIG_NO_HZ
2) If CONFIG_NO_HZ is set, we default to CONFIG_NO_HZ_IDLE
We might be able to support that from Kconfig/Kbuild but it
may not be wise to introduce such a confusing behaviour.
So to solve this, create a new CONFIG_NO_HZ_COMMON option
which gathers the common code between idle and full dynticks
(that common code for now is simply the idle dynticks code)
and select it from their referring Kconfig.
Then we'll later create CONFIG_NO_HZ_IDLE and map CONFIG_NO_HZ
to it for backward compatibility.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
The full dynticks feature only shows up when all its
Kconfig dependencies are met (RCU nocbs, RCU user mode, ...)
This is far from being user friendly as those who want to
activate this feature need to look into the Kconfig files
and iterate through each dependency then activate these
by hand in order to show and select the full dynticks
Kconfig option.
So process the other way around: show up the Kconfig option
if the minimal low level dependencies are met and activate
the high level ones when we enable the feature.
Note there is one exception in the picture:
CONFIG_VIRT_CPU_ACCOUNTING_GEN is part of a Kconfig choice
menu and it appears we can't select it from another Kconfig
selection when it's under such layout. So for now this
particular item stays as a passive dependency.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQEcBAABAgAGBQJRWLTrAAoJEHm+PkMAQRiGe8oH/iMy48mecVWvxVZn74Tx3Cef
xmW/PnAIj28EhSPqK49N/Ow6AfQToFKf7AP0ge20KAf5teTq95AY+tH74DAANt8F
BjKXXTZiR5xwBvRkq7CR5wDcCvEcBAAz8fgTEd6SEDB2d2VXFf5eKdKUqt1avTCh
Z6Hup5kuwX+ddtwY2DCBXtp2n6fL0Rm5yLzY1A3OOBye1E7VyLTF7M5BR603Q44P
4kRLxn8+R7jy3hTuZIhAeoS8TKUoBwVk7DmKxEzrhTHZVOmvwE9lEHybRnIyOpd/
k1JnbRbiPsLsCVFOn10SQkGDAIk00lro3tuWP2C1ljERiD/OOh5Ui9nXYAhMkbI=
=q15K
-----END PGP SIGNATURE-----
Merge tag 'v3.9-rc5' into wq/for-3.10
Writeback conversion to workqueue will be based on top of wq/for-3.10
branch to take advantage of custom attrs and NUMA support for unbound
workqueues. Mainline currently contains two commits which result in
non-trivial merge conflicts with wq/for-3.10 and because
block/for-3.10/core is based on v3.9-rc3 which contains one of the
conflicting commits, we need a pre-merge-window merge anyway. Let's
pull v3.9-rc5 into wq/for-3.10 so that the block tree doesn't suffer
from workqueue merge conflicts.
The two conflicts and their resolutions:
* e68035fb65 ("workqueue: convert to idr_alloc()") in mainline changes
worker_pool_assign_id() to use idr_alloc() instead of the old idr
interface. worker_pool_assign_id() goes through multiple locking
changes in wq/for-3.10 causing the following conflict.
static int worker_pool_assign_id(struct worker_pool *pool)
{
int ret;
<<<<<<< HEAD
lockdep_assert_held(&wq_pool_mutex);
do {
if (!idr_pre_get(&worker_pool_idr, GFP_KERNEL))
return -ENOMEM;
ret = idr_get_new(&worker_pool_idr, pool, &pool->id);
} while (ret == -EAGAIN);
=======
mutex_lock(&worker_pool_idr_mutex);
ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
if (ret >= 0)
pool->id = ret;
mutex_unlock(&worker_pool_idr_mutex);
>>>>>>> c67bf5361e
return ret < 0 ? ret : 0;
}
We want locking from the former and idr_alloc() usage from the
latter, which can be combined to the following.
static int worker_pool_assign_id(struct worker_pool *pool)
{
int ret;
lockdep_assert_held(&wq_pool_mutex);
ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
if (ret >= 0) {
pool->id = ret;
return 0;
}
return ret;
}
* eb2834285c ("workqueue: fix possible pool stall bug in
wq_unbind_fn()") updated wq_unbind_fn() such that it has single
larger for_each_std_worker_pool() loop instead of two separate loops
with a schedule() call inbetween. wq/for-3.10 renamed
pool->assoc_mutex to pool->manager_mutex causing the following
conflict (earlier function body and comments omitted for brevity).
static void wq_unbind_fn(struct work_struct *work)
{
...
spin_unlock_irq(&pool->lock);
<<<<<<< HEAD
mutex_unlock(&pool->manager_mutex);
}
=======
mutex_unlock(&pool->assoc_mutex);
>>>>>>> c67bf5361e
schedule();
<<<<<<< HEAD
for_each_cpu_worker_pool(pool, cpu)
=======
>>>>>>> c67bf5361e
atomic_set(&pool->nr_running, 0);
spin_lock_irq(&pool->lock);
wake_up_worker(pool);
spin_unlock_irq(&pool->lock);
}
}
The resolution is mostly trivial. We want the control flow of the
latter with the rename of the former.
static void wq_unbind_fn(struct work_struct *work)
{
...
spin_unlock_irq(&pool->lock);
mutex_unlock(&pool->manager_mutex);
schedule();
atomic_set(&pool->nr_running, 0);
spin_lock_irq(&pool->lock);
wake_up_worker(pool);
spin_unlock_irq(&pool->lock);
}
}
Signed-off-by: Tejun Heo <tj@kernel.org>
Unbound workqueues are now NUMA aware. Let's add some control knobs
and update sysfs interface accordingly.
* Add kernel param workqueue.numa_disable which disables NUMA affinity
globally.
* Replace sysfs file "pool_id" with "pool_ids" which contain
node:pool_id pairs. This change is userland-visible but "pool_id"
hasn't seen a release yet, so this is okay.
* Add a new sysf files "numa" which can toggle NUMA affinity on
individual workqueues. This is implemented as attrs->no_numa whichn
is special in that it isn't part of a pool's attributes. It only
affects how apply_workqueue_attrs() picks which pools to use.
After "pool_ids" change, first_pwq() doesn't have any user left.
Removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Factor out lock pool, put_pwq(), unlock sequence into
put_pwq_unlocked(). The two existing places are converted and there
will be more with NUMA affinity support.
This is to prepare for NUMA affinity support for unbound workqueues
and doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Factor out pool_workqueue linking and installation into numa_pwq_tbl[]
from apply_workqueue_attrs() into numa_pwq_tbl_install(). link_pwq()
is made safe to call multiple times. numa_pwq_tbl_install() links the
pwq, installs it into numa_pwq_tbl[] at the specified node and returns
the old entry.
@last_pwq is removed from link_pwq() as the return value of the new
function can be used instead.
This is to prepare for NUMA affinity support for unbound workqueues.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Use kmem_cache_alloc_node() with @pool->node instead of
kmem_cache_zalloc() when allocating a pool_workqueue so that it's
allocated on the same node as the associated worker_pool. As there's
no no kmem_cache_zalloc_node(), move zeroing to init_pwq().
This was suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Break init_and_link_pwq() into init_pwq() and link_pwq() and move
unbound-workqueue specific handling into apply_workqueue_attrs().
Also, factor out unbound pool and pool_workqueue allocation into
alloc_unbound_pwq().
This reorganization is to prepare for NUMA affinity and doesn't
introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Currently, an unbound workqueue has only one "current" pool_workqueue
associated with it. It may have multple pool_workqueues but only the
first pool_workqueue servies new work items. For NUMA affinity, we
want to change this so that there are multiple current pool_workqueues
serving different NUMA nodes.
Introduce workqueue->numa_pwq_tbl[] which is indexed by NUMA node and
points to the pool_workqueue to use for each possible node. This
replaces first_pwq() in __queue_work() and workqueue_congested().
numa_pwq_tbl[] is currently initialized to point to the same
pool_workqueue as first_pwq() so this patch doesn't make any behavior
changes.
v2: Use rcu_dereference_raw() in unbound_pwq_by_node() as the function
may be called only with wq->mutex held.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Move wq->flags and ->cpu_pwqs to the end of workqueue_struct and align
them to the cacheline. These two fields are used in the work item
issue path and thus hot. The scheduled NUMA affinity support will add
dispatch table at the end of workqueue_struct and relocating these two
fields will allow us hitting only single cacheline on hot paths.
Note that wq->pwqs isn't moved although it currently is being used in
the work item issue path for unbound workqueues. The dispatch table
mentioned above will replace its use in the issue path, so it will
become cold once NUMA support is implemented.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Currently workqueue->name[] is of flexible length. We want to use the
flexible field for something more useful and there isn't much benefit
in allowing arbitrary name length anyway. Make it fixed len capping
at 24 bytes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Currently, when exposing attrs of an unbound workqueue via sysfs, the
workqueue_attrs of first_pwq() is used as that should equal the
current state of the workqueue.
The planned NUMA affinity support will make unbound workqueues make
use of multiple pool_workqueues for different NUMA nodes and the above
assumption will no longer hold. Introduce workqueue->unbound_attrs
which records the current attrs in effect and use it for sysfs instead
of first_pwq()->attrs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
When worker tasks are created using kthread_create_on_node(),
currently only per-cpu ones have the matching NUMA node specified.
All unbound workers are always created with NUMA_NO_NODE.
Now that an unbound worker pool may have an arbitrary cpumask
associated with it, this isn't optimal. Add pool->node which is
determined by the pool's cpumask. If the pool's cpumask is contained
inside a NUMA node proper, the pool is associated with that node, and
all workers of the pool are created on that node.
This currently only makes difference for unbound worker pools with
cpumask contained inside single NUMA node, but this will serve as
foundation for making all unbound pools NUMA-affine.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Currently, all workqueue workers which have negative nice value has
'H' postfixed to their names. This is necessary for per-cpu workers
as they use the CPU number instead of pool->id to identify the pool
and the 'H' postfix is the only thing distinguishing normal and
highpri workers.
As workers for unbound pools use pool->id, the 'H' postfix is purely
informational. TASK_COMM_LEN is 16 and after the static part and
delimiters, there are only five characters left for the pool and
worker IDs. We're expecting to have more unbound pools with the
scheduled NUMA awareness support. Let's drop the non-essential 'H'
postfix from unbound kworker name.
While at it, restructure kthread_create*() invocation to help future
NUMA related changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Unbound workqueues are going to be NUMA-affine. Add wq_numa_tbl_len
and wq_numa_possible_cpumask[] in preparation. The former is the
highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
each NUMA node.
This patch only introduces these. Future patches will make use of
them.
v2: NUMA initialization move into wq_numa_init(). Also, the possible
cpumask array is not created if there aren't multiple nodes on the
system. wq_numa_enabled bool added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
The scheduled NUMA affinity support for unbound workqueues would need
to walk workqueues list and pool related operations on each workqueue.
Move wq_pool_mutex locking out of get/put_unbound_pool() to their
callers so that pool operations can be performed while walking the
workqueues list, which is also protected by wq_pool_mutex.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
29c91e9912 ("workqueue: implement attribute-based unbound worker_pool
management") implemented attrs based worker_pool matching. It tried
to avoid false negative when comparing cpumasks with custom hash
function; unfortunately, the hash and comparison functions fail to
ignore CPUs which are not possible. It incorrectly assumed that
bitmap_copy() skips leftover bits in the last word of bitmap and
cpumask_equal() ignores impossible CPUs.
This patch updates attrs->cpumask handling such that impossible CPUs
are properly ignored.
* Hash and copy functions no longer do anything special. They expect
their callers to clear impossible CPUs.
* alloc_workqueue_attrs() initializes the cpumask to cpu_possible_mask
instead of setting all bits and explicit cpumask_setall() for
unbound_std_wq_attrs[] in init_workqueues() is dropped.
* apply_workqueue_attrs() is now responsible for ignoring impossible
CPUs. It makes a copy of @attrs and clears impossible CPUs before
doing anything else.
Signed-off-by: Tejun Heo <tj@kernel.org>
8864b4e59 ("workqueue: implement get/put_pwq()") implemented pwq
(pool_workqueue) refcnting which frees workqueue when the last pwq
goes away. It determined whether it was the last pwq by testing
wq->pwqs is empty. Unfortunately, the test was done outside wq->mutex
and multiple pwq release could race and try to free wq multiple times
leading to oops.
Test wq->pwqs emptiness while holding wq->mutex.
Signed-off-by: Tejun Heo <tj@kernel.org>
Conflicts:
net/mac80211/sta_info.c
net/wireless/core.h
Two minor conflicts in wireless. Overlapping additions of extern
declarations in net/wireless/core.h and a bug fix overlapping with
the addition of a boolean parameter to __ieee80211_key_free().
Signed-off-by: David S. Miller <davem@davemloft.net>
Type of mapping was lost and made it hard for a tool
to distinguish code vs. data mmaps. Perf has the ability
to distinguish the two.
Use a bit in the header->misc bitmask to keep track of
the mmap type. If PERF_RECORD_MISC_MMAP_DATA is set then
the mapping is not executable (!VM_EXEC). If not set, then
the mapping is executable.
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: ak@linux.intel.com
Cc: acme@redhat.com
Cc: jolsa@redhat.com
Cc: namhyung.kim@lge.com
Link: http://lkml.kernel.org/r/1359040242-8269-16-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This patch adds PERF_SAMPLE_DATA_SRC.
PERF_SAMPLE_DATA_SRC collects the data source, i.e., where
did the data associated with the sampled instruction
come from. Information is stored in a perf_mem_data_src
structure. It contains opcode, mem level, tlb, snoop,
lock information, subject to availability in hardware.
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: ak@linux.intel.com
Cc: acme@redhat.com
Cc: jolsa@redhat.com
Cc: namhyung.kim@lge.com
Link: http://lkml.kernel.org/r/1359040242-8269-8-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
For some events it's useful to weight sample with a hardware
provided number. This expresses how expensive the action the
sample represent was. This allows the profiler to scale
the samples to be more informative to the programmer.
There is already the period which is used similarly, but it
means something different, so I chose to not overload it.
Instead a new sample type for WEIGHT is added.
Can be used for multiple things. Initially it is used for TSX
abort costs and profiling by memory latencies (so to make
expensive load appear higher up in the histograms). The concept
is quite generic and can be extended to many other kinds of
events or architectures, as long as the hardware provides
suitable auxillary values. In principle it could be also used
for software tracepoints.
This adds the generic glue. A new optional sample format for a
64-bit weight value.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: acme@redhat.com
Cc: jolsa@redhat.com
Cc: namhyung.kim@lge.com
Link: http://lkml.kernel.org/r/1359040242-8269-5-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This reverts commit 6aa9707099.
Commit 6aa9707099 ("lockdep: check that no locks held at freeze time")
causes problems with NFS root filesystems. The failures were noticed on
OMAP2 and 3 boards during kernel init:
[ BUG: swapper/0/1 still has locks held! ]
3.9.0-rc3-00344-ga937536 #1 Not tainted
-------------------------------------
1 lock held by swapper/0/1:
#0: (&type->s_umount_key#13/1){+.+.+.}, at: [<c011e84c>] sget+0x248/0x574
stack backtrace:
rpc_wait_bit_killable
__wait_on_bit
out_of_line_wait_on_bit
__rpc_execute
rpc_run_task
rpc_call_sync
nfs_proc_get_root
nfs_get_root
nfs_fs_mount_common
nfs_try_mount
nfs_fs_mount
mount_fs
vfs_kern_mount
do_mount
sys_mount
do_mount_root
mount_root
prepare_namespace
kernel_init_freeable
kernel_init
Although the rootfs mounts, the system is unstable. Here's a transcript
from a PM test:
http://www.pwsan.com/omap/testlogs/test_v3.9-rc3/20130317194234/pm/37xxevm/37xxevm_log.txt
Here's what the test log should look like:
http://www.pwsan.com/omap/testlogs/test_v3.8/20130218214403/pm/37xxevm/37xxevm_log.txt
Mailing list discussion is here:
http://lkml.org/lkml/2013/3/4/221
Deal with this for v3.9 by reverting the problem commit, until folks can
figure out the right long-term course of action.
Signed-off-by: Paul Walmsley <paul@pwsan.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Shawn Guo <shawn.guo@linaro.org>
Cc: <maciej.rutecki@gmail.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ben Chan <benchan@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 9419121330 replaced the macros
NLMSG_NEXT with calls to nlmsg_next which produces this warning:
kernel/audit.c: In function ‘audit_receive_skb’:
kernel/audit.c:928:3: warning: passing argument 2 of ‘nlmsg_next’ makes pointer from integer without a cast
In file included from include/net/rtnetlink.h:5:0,
from include/net/neighbour.h:28,
from include/net/dst.h:17,
from include/net/sock.h:68,
from kernel/audit.c:55:
include/net/netlink.h:359:1: note: expected ‘int *’ but argument is of type ‘int’
Fix this by sending the intended pointer.
Signed-off-by: Alexandru Copot <alex.mihai.c@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull userns fixes from Eric W Biederman:
"The bulk of the changes are fixing the worst consequences of the user
namespace design oversight in not considering what happens when one
namespace starts off as a clone of another namespace, as happens with
the mount namespace.
The rest of the changes are just plain bug fixes.
Many thanks to Andy Lutomirski for pointing out many of these issues."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Restrict when proc and sysfs can be mounted
ipc: Restrict mounting the mqueue filesystem
vfs: Carefully propogate mounts across user namespaces
vfs: Add a mount flag to lock read only bind mounts
userns: Don't allow creation if the user is chrooted
yama: Better permission check for ptraceme
pid: Handle the exit of a multi-threaded init.
scm: Require CAP_SYS_ADMIN over the current pidns to spoof pids.
Conflicts:
include/net/ipip.h
The changes made to ipip.h in 'net' were already included
in 'net-next' before that header was moved to another location.
Signed-off-by: David S. Miller <davem@davemloft.net>
Only allow unprivileged mounts of proc and sysfs if they are already
mounted when the user namespace is created.
proc and sysfs are interesting because they have content that is
per namespace, and so fresh mounts are needed when new namespaces
are created while at the same time proc and sysfs have content that
is shared between every instance.
Respect the policy of who may see the shared content of proc and sysfs
by only allowing new mounts if there was an existing mount at the time
the user namespace was created.
In practice there are only two interesting cases: proc and sysfs are
mounted at their usual places, proc and sysfs are not mounted at all
(some form of mount namespace jail).
Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Guarantee that the policy of which files may be access that is
established by setting the root directory will not be violated
by user namespaces by verifying that the root directory points
to the root of the mount namespace at the time of user namespace
creation.
Changing the root is a privileged operation, and as a matter of policy
it serves to limit unprivileged processes to files below the current
root directory.
For reasons of simplicity and comprehensibility the privilege to
change the root directory is gated solely on the CAP_SYS_CHROOT
capability in the user namespace. Therefore when creating a user
namespace we must ensure that the policy of which files may be access
can not be violated by changing the root directory.
Anyone who runs a processes in a chroot and would like to use user
namespace can setup the same view of filesystems with a mount
namespace instead. With this result that this is not a practical
limitation for using user namespaces.
Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The current code makes the assumption that a cpu_base lock won't be
held if the CPU corresponding to that cpu_base is offline, which isn't
always true.
If a hrtimer is not queued, then it will not be migrated by
migrate_hrtimers() when a CPU is offlined. Therefore, the hrtimer's
cpu_base may still point to a CPU which has subsequently gone offline
if the timer wasn't enqueued at the time the CPU went down.
Normally this wouldn't be a problem, but a cpu_base's lock is blindly
reinitialized each time a CPU is brought up. If a CPU is brought
online during the period that another thread is performing a hrtimer
operation on a stale hrtimer, then the lock will be reinitialized
under its feet, and a SPIN_BUG() like the following will be observed:
<0>[ 28.082085] BUG: spinlock already unlocked on CPU#0, swapper/0/0
<0>[ 28.087078] lock: 0xc4780b40, value 0x0 .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1
<4>[ 42.451150] [<c0014398>] (unwind_backtrace+0x0/0x120) from [<c0269220>] (do_raw_spin_unlock+0x44/0xdc)
<4>[ 42.460430] [<c0269220>] (do_raw_spin_unlock+0x44/0xdc) from [<c071b5bc>] (_raw_spin_unlock+0x8/0x30)
<4>[ 42.469632] [<c071b5bc>] (_raw_spin_unlock+0x8/0x30) from [<c00a9ce0>] (__hrtimer_start_range_ns+0x1e4/0x4f8)
<4>[ 42.479521] [<c00a9ce0>] (__hrtimer_start_range_ns+0x1e4/0x4f8) from [<c00aa014>] (hrtimer_start+0x20/0x28)
<4>[ 42.489247] [<c00aa014>] (hrtimer_start+0x20/0x28) from [<c00e6190>] (rcu_idle_enter_common+0x1ac/0x320)
<4>[ 42.498709] [<c00e6190>] (rcu_idle_enter_common+0x1ac/0x320) from [<c00e6440>] (rcu_idle_enter+0xa0/0xb8)
<4>[ 42.508259] [<c00e6440>] (rcu_idle_enter+0xa0/0xb8) from [<c000f268>] (cpu_idle+0x24/0xf0)
<4>[ 42.516503] [<c000f268>] (cpu_idle+0x24/0xf0) from [<c06ed3c0>] (rest_init+0x88/0xa0)
<4>[ 42.524319] [<c06ed3c0>] (rest_init+0x88/0xa0) from [<c0c00978>] (start_kernel+0x3d0/0x434)
As an example, this particular crash occurred when hrtimer_start() was
executed on CPU #0. The code locked the hrtimer's current cpu_base
corresponding to CPU #1. CPU #0 then tried to switch the hrtimer's
cpu_base to an optimal CPU which was online. In this case, it selected
the cpu_base corresponding to CPU #3.
Before it could proceed, CPU #1 came online and reinitialized the
spinlock corresponding to its cpu_base. Thus now CPU #0 held a lock
which was reinitialized. When CPU #0 finally ended up unlocking the
old cpu_base corresponding to CPU #1 so that it could switch to CPU
#3, we hit this SPIN_BUG() above while in switch_hrtimer_base().
CPU #0 CPU #1
---- ----
... <offline>
hrtimer_start()
lock_hrtimer_base(base #1)
... init_hrtimers_cpu()
switch_hrtimer_base() ...
... raw_spin_lock_init(&cpu_base->lock)
raw_spin_unlock(&cpu_base->lock) ...
<spin_bug>
Solve this by statically initializing the lock.
Signed-off-by: Michael Bohan <mbohan@codeaurora.org>
Link: http://lkml.kernel.org/r/1363745965-23475-1-git-send-email-mbohan@codeaurora.org
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
doc.2013.03.12a: Documentation changes.
fixes.2013.03.13a: Miscellaneous fixes.
idlenocb.2013.03.26b: Remove restrictions on no-CBs CPUs, make
RCU_FAST_NO_HZ take advantage of numbered callbacks, add
callback acceleration based on numbered callbacks.
Now that rcu_start_future_gp() has been abstracted from
rcu_nocb_wait_gp(), rcu_accelerate_cbs() can invoke rcu_start_future_gp()
so as to register the need for any future grace periods needed by a
CPU about to enter dyntick-idle mode. This commit makes this change.
Note that some refactoring of rcu_start_gp() is carried out to avoid
recursion and subsequent self-deadlocks.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CPUs going idle will need to record the need for a future grace
period, but won't actually need to block waiting on it. This commit
therefore splits rcu_start_future_gp(), which does the recording, from
rcu_nocb_wait_gp(), which now invokes rcu_start_future_gp() to do the
recording, after which rcu_nocb_wait_gp() does the waiting.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CPUs going idle need to be able to indicate their need for future grace
periods. A mechanism for doing this already exists for no-callbacks
CPUs, so the idea is to re-use that mechanism. This commit therefore
moves the ->n_nocb_gp_requests field of the rcu_node structure out from
under the CONFIG_RCU_NOCB_CPU #ifdef and renames it to ->need_future_gp.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If CPUs are to give prior notice of needed grace periods, it will be
necessary to invoke rcu_start_gp() without dropping the root rcu_node
structure's ->lock. This commit takes a second step in this direction
by moving the release of this lock to rcu_start_gp()'s callers.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Dyntick-idle CPUs need to be able to pre-announce their need for grace
periods. This can be done using something similar to the mechanism used
by no-CB CPUs to announce their need for grace periods. This commit
moves in this direction by renaming the no-CBs grace-period event tracing
to suit the new future-grace-period needs.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If CPUs are to give prior notice of needed grace periods, it will be
necessary to invoke rcu_start_gp() without dropping the root rcu_node
structure's ->lock. This commit takes a first step in this direction
by moving the release of this lock to the end of rcu_start_gp().
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Because RCU callbacks are now associated with the number of the grace
period that they must wait for, CPUs can now take advance callbacks
corresponding to grace periods that ended while a given CPU was in
dyntick-idle mode. This eliminates the need to try forcing the RCU
state machine while entering idle, thus reducing the CPU intensiveness
of RCU_FAST_NO_HZ, which should increase its energy efficiency.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that callback acceleration is idempotent, it is safe to accelerate
callbacks during grace-period cleanup on any CPUs that the kthread happens
to be running on. This commit therefore propagates the completion
of the grace period to the per-CPU data structures, and also adds an
rcu_advance_cbs() just before the cpu_needs_another_gp() check in order
to reduce false-positive grace periods.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
RCU_FAST_NO_HZ operation is controlled by four compile-time C-preprocessor
macros, but some use cases benefit greatly from runtime adjustment,
particularly when tuning devices. This commit therefore creates the
corresponding sysfs entries.
Reported-by: Robin Randhawa <robin.randhawa@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, the per-no-CBs-CPU kthreads are named "rcuo" followed by
the CPU number, for example, "rcuo". This is problematic given that
there are either two or three RCU flavors, each of which gets a per-CPU
kthread with exactly the same name. This commit therefore introduces
a one-letter abbreviation for each RCU flavor, namely 'b' for RCU-bh,
'p' for RCU-preempt, and 's' for RCU-sched. This abbreviation is used
to distinguish the "rcuo" kthreads, for example, for CPU 0 we would have
"rcuob/0", "rcuop/0", and "rcuos/0".
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Currently, the no-CBs kthreads do repeated timed waits for grace periods
to elapse. This is crude and energy inefficient, so this commit allows
no-CBs kthreads to specify exactly which grace period they are waiting
for and also allows them to block for the entire duration until the
desired grace period completes.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, the only way to specify no-CBs CPUs is via the rcu_nocbs
kernel command-line parameter. This is inconvenient in some cases,
particularly for randconfig testing, so this commit adds a new set of
kernel configuration parameters. CONFIG_RCU_NOCB_CPU_NONE (the default)
retains the old behavior, CONFIG_RCU_NOCB_CPU_ZERO offloads callback
processing from CPU 0 (along with any other CPUs specified by the
rcu_nocbs boot-time parameter), and CONFIG_RCU_NOCB_CPU_ALL offloads
callback processing from all CPUs.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When a multi-threaded init exits and the initial thread is not the
last thread to exit the initial thread hangs around as a zombie
until the last thread exits. In that case zap_pid_ns_processes
needs to wait until there are only 2 hashed pids in the pid
namespace not one.
v2. Replace thread_pid_vnr(me) == 1 with the test thread_group_leader(me)
as suggested by Oleg.
Cc: stable@vger.kernel.org
Cc: Oleg Nesterov <oleg@redhat.com>
Reported-by: Caj Larsson <caj@omnicloud.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull timer fix from Thomas Gleixner:
"A single bugfix which prevents that a non functional timer device is
selected to provide the fallback device, which is supposed to serve
timer interrupts on behalf of non functional devices ..."
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clockevents: Don't allow dummy broadcast timers
Allow BPF_XOR based ALU instructions.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Will Drewry <wad@chromium.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>
To simplify locking, the previous patches expanded wq->mutex to
protect all fields of each workqueue instance including the pwqs list
leaving pwq_lock without any user. Remove the unused pwq_lock.
tj: Rebased on top of the current dev branch. Updated description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We're expanding wq->mutex to cover all fields specific to each
workqueue with the end goal of replacing pwq_lock which will make
locking simpler and easier to understand.
This patch makes wq->saved_max_active protected by wq->mutex instead
of pwq_lock. As pwq_lock locking around pwq_adjust_mac_active() is no
longer necessary, this patch also replaces pwq_lock lockings of
for_each_pwq() around pwq_adjust_max_active() to wq->mutex.
tj: Rebased on top of the current dev branch. Updated description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We're expanding wq->mutex to cover all fields specific to each
workqueue with the end goal of replacing pwq_lock which will make
locking simpler and easier to understand.
init_and_link_pwq() and pwq_unbound_release_workfn() already grab
wq->mutex when adding or removing a pwq from wq->pwqs list. This
patch makes it official that the list is wq->mutex protected for
writes and updates readers accoridingly. Explicit IRQ toggles for
sched-RCU read-locking in flush_workqueue_prep_pwqs() and
drain_workqueues() are removed as the surrounding wq->mutex can
provide sufficient synchronization.
Also, assert_rcu_or_pwq_lock() is renamed to assert_rcu_or_wq_mutex()
and checks for wq->mutex too.
pwq_lock locking and assertion are not removed by this patch and a
couple of for_each_pwq() iterations are still protected by it.
They'll be removed by future patches.
tj: Rebased on top of the current dev branch. Updated description.
Folded in assert_rcu_or_wq_mutex() renaming from a later patch
along with associated comment updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We're expanding wq->mutex to cover all fields specific to each
workqueue with the end goal of replacing pwq_lock which will make
locking simpler and easier to understand.
wq->nr_drainers and ->flags are specific to each workqueue. Protect
->nr_drainers and ->flags with wq->mutex instead of pool_mutex.
tj: Rebased on top of the current dev branch. Updated description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently pwq->flush_mutex protects many fields of a workqueue
including, especially, the pwqs list. We're going to expand this
mutex to protect most of a workqueue and eventually replace pwq_lock,
which will make locking simpler and easier to understand.
Drop the "flush_" prefix in preparation.
This patch is pure rename.
tj: Rebased on top of the current dev branch. Updated description.
Use WQ: and WR: instead of Q: and QR: for synchronization labels.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
wq->flush_mutex will be renamed to wq->mutex and cover all fields
specific to each workqueue and eventually replace pwq_lock, which will
make locking simpler and easier to understand.
Rename wq_mutex to wq_pool_mutex to avoid confusion with wq->mutex.
After the scheduled changes, wq_pool_mutex won't be protecting
anything specific to each workqueue instance anyway.
This patch is pure rename.
tj: s/wqs_mutex/wq_pool_mutex/. Rewrote description.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Yet again, the kbuild test robot saves the day, noting
I left out defining __timekeeping_set_tai_offset as
static. It even sent me this patch.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The "NOHZ: local_softirq_pending" message is a largely informational
message. This makes extra work for customers that have a policy of
investigating all kernel log messages logged at <= KERN_ERR log level.
This patch sets the message to a different log level.
[ tglx: Use pr_warn() ]
Signed-off-by: Rado Vrbovsky <rvrbovsk@redhat.com>
Cc: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/r/2037057938.893524.1360345050772.JavaMail.root@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The comments mention HRTIMER_ABS and HRTIMER_REL, these symbols don't
exist, the proper names are HRTIMER_MODE_ABS and HRTIMER_MODE_REL.
Signed-off-by: David Daney <david.daney@cavium.com>
Cc: Jiri Kosina <trivial@kernel.org>
Link: http://lkml.kernel.org/r/1363202438-21234-1-git-send-email-ddaney.cavm@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Does writethrough and writeback caching, handles unclean shutdown, and
has a bunch of other nifty features motivated by real world usage.
See the wiki at http://bcache.evilpiepirate.org for more.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Hack, but bcache needs a way around lockdep for locking during garbage
collection - we need to keep multiple btree nodes locked for coalescing
and rw_lock_nested() isn't really sufficient or appropriate here.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Ingo Molnar <mingo@redhat.com>
Exported so it can be used by bcache's tracepoints
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@redhat.com>
This reverts commit 11b80f459a.
Bcache needs rw semaphores for cache coherency in writeback mode -
writes have to take a read lock on a per cache device rw sem, and
release it when the bio completes.
But since this is for bios it's naturally not in the context of the
process that originally took the lock.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: David Howells <dhowells@redhat.com>
David said:
Commit 6c0c0d4d10 ("poweroff: fix bug in orderly_poweroff()")
apparently fixes one bug in orderly_poweroff(), but introduces
another. The comments on orderly_poweroff() claim it can be called
from any context - and indeed we call it from interrupt context in
arch/powerpc/platforms/pseries/ras.c for example. But since that
commit this is no longer safe, since call_usermodehelper_fns() is not
safe in interrupt context without the UMH_NO_WAIT option.
orderly_poweroff() can be used from any context but UMH_WAIT_EXEC is
sleepable. Move the "force" logic into __orderly_poweroff() and change
orderly_poweroff() to use the global poweroff_work which simply calls
__orderly_poweroff().
While at it, remove the unneeded "int argc" and change argv_split() to
use GFP_KERNEL.
We use the global "bool poweroff_force" to pass the argument, this can
obviously affect the previous request if it is pending/running. So we
only allow the "false => true" transition assuming that the pending
"true" should succeed anyway. If schedule_work() fails after that we
know that work->func() was not called yet, it must see the new value.
This means that orderly_poweroff() becomes async even if we do not run
the command and always succeeds, schedule_work() can only fail if the
work is already pending. We can export __orderly_poweroff() and change
the non-atomic callers which want the old semantics.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Feng Hong <hongfeng@marvell.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
wake_up_klogd() is useless when CONFIG_PRINTK=n because neither printk()
nor printk_sched() are in use and there are actually no waiter on
log_wait waitqueue. It should be a stub in this case for users like
bust_spinlocks().
Otherwise this results in this warning when CONFIG_PRINTK=n and
CONFIG_IRQ_WORK=n:
kernel/built-in.o In function `wake_up_klogd':
(.text.wake_up_klogd+0xb4): undefined reference to `irq_work_queue'
To fix this, provide an off-case for wake_up_klogd() when
CONFIG_PRINTK=n.
There is much more from console_unlock() and other console related code
in printk.c that should be moved under CONFIG_PRINTK. But for now,
focus on a minimal fix as we passed the merged window already.
[akpm@linux-foundation.org: include printk.h in bust_spinlocks.c]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reported-by: James Hogan <james.hogan@imgtec.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We want to shorten the seqcount write hold time. So split the seqlock
into a lock and a seqcount.
Open code the seqwrite_lock in the places which matter and drop the
sequence counter update where it's pointless.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[jstultz: Merge fixups from CLOCK_TAI collisions]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Make the lock a separate entity. Preparatory patch for shadow
timekeeper structure.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[Merged with CLOCK_TAI changes]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Nothing outside of the timekeeping core needs that lock.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Calculate the cycle interval shifted value once. No functional change,
just makes the code more readable.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
This add a CLOCK_TAI clockid and the needed accessors.
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Currently NTP manages the TAI offset. Since there's plans for a
CLOCK_TAI clockid, push the TAI management into the timekeeping
core.
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Pull perf fixes from Ingo Molnar:
"A fair chunk of the linecount comes from a fix for a tracing bug that
corrupts latency tracing buffers when the overwrite mode is changed on
the fly - the rest is mostly assorted fewliner fixlets."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Add SNB/SNB-EP scheduling constraints for cycle_activity event
kprobes/x86: Check Interrupt Flag modifier when registering probe
kprobes: Make hash_64() as always inlined
perf: Generate EXIT event only once per task context
perf: Reset hwc->last_period on sw clock events
tracing: Prevent buffer overwrite disabled for latency tracers
tracing: Keep overwrite in sync between regular and snapshot buffers
tracing: Protect tracer flags with trace_types_lock
perf tools: Fix LIBNUMA build with glibc 2.12 and older.
tracing: Fix free of probe entry by calling call_rcu_sched()
perf/POWER7: Create a sysfs format entry for Power7 events
perf probe: Fix segfault
libtraceevent: Remove hard coded include to /usr/local/include in Makefile
perf record: Fix -C option
perf tools: check if -DFORTIFY_SOURCE=2 is allowed
perf report: Fix build with NO_NEWT=1
perf annotate: Fix build with NO_NEWT=1
tracing: Fix race in snapshot swapping
Wake up a CPU when a timer list timer is enqueued there and
the target is part of the full dynticks range. Sending an IPI
to it makes it reconsidering the next timer to program on top
of recent updates.
This may later be improved by checking if the tick is really
stopped on the target. This would need some careful
synchronization though. So deal with such optimization later
and start simple.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
This way the full nohz CPUs can safely run with the tick
stopped with a guarantee that somebody else is taking
care of the jiffies and GTOD progression.
Once the duty is attributed to a CPU, it won't change. Also that
CPU can't enter into dyntick idle mode or be hot unplugged.
This may later be improved from a power consumption POV. At
least we should be able to share the duty amongst all CPUs
outside the full dynticks range. Then the duty could even be
shared with full dynticks CPUs when those can't stop their
tick for any reason.
But let's start with that very simple approach first.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
[fix have_nohz_full_mask offcase]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
For extreme usecases such as Real Time or HPC, having
the ability to shutdown the tick when a single task runs
on a CPU is a desired feature:
* Reducing the amount of interrupts improves throughput
for CPU-bound tasks. The CPU is less distracted from its
real job, from an execution time and from the cache point
of views.
* This also improve latency response as we have less critical
sections.
Start with introducing a very simple interface to define
full dynticks CPU: use a boot time option defined cpumask
through the "nohz_extended=" kernel parameter. CPUs that
are part of this range will have their tick shutdown
whenever possible: provided they run a single task and
they don't do kernel activity that require the periodic
tick. These details will be later documented in
Documentation/*
An online CPU must be kept outside this range to handle the
timekeeping.
Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
This patch fixes a flaw in perf_output_space(). In case the size
of the space needed is bigger than the actual buffer size, there
may be situations where the function would return true (i.e.,
there is space) when it should not. head > offset due to
rounding of the masking logic.
The problem can be tested by activating BTS on Intel processors.
A BTS record can be as big as 16 pages. The following command
fails:
$ perf record -m 4 -c 1 -e branches:u my_test_program
You will get a buffer corruption with this. Perf report won't be
able to parse the perf.data.
The fix is to first check that the requested space is smaller
than the buffer size. If so, then the masking logic will work
fine. If not, then there is no chance the record can be saved
and it will be gracefully handled by upper code layers.
[ In v2, we also make the logic for the writable more explicit by
renaming it to rb->overwrite because it tells whether or not the
buffer can overwrite its tail (suggested by PeterZ). ]
Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20130318133327.GA3056@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
try_to_wake_up_local() should only be invoked to wake up another
task in the same runqueue and BUG_ON()s are used to enforce the
rule. Missing try_to_wake_up_local() can stall workqueue
execution but such stalls are likely to be finite either by
another work item being queued or the one blocked getting
unblocked. There's no reason to trigger BUG while holding rq
lock crashing the whole system.
Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130318192234.GD3042@htj.dyndns.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge in all pending fixes, before pulling the latest development
bits from Arnaldo - which will involve merge conflicts.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Update the README file in debugfs/tracing to something more useful.
What's currently in the file is very old and what it shows doesn't
have much use. Heck, it tells you how to mount debugfs! But to read
this file you would have already needed to mount it.
Replace the file with current up-to-date information. It's rather
limited, but what do you expect from a pseudo README file.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If lockdep complains something for other subsystem, lockdep_is_held()
can be false negative, so we need to also test debug_locks before
triggering WARN.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
rcu_read_lock_sched() is better than preempt_disable() if the code is
protected by RCU_SCHED.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
If pwq_adjust_max_active() changes max_active from 0 to
saved_max_active, it needs to wakeup worker. This is already done by
thaw_workqueues().
If pwq_adjust_max_active() increases max_active for an unbound wq,
while not strictly necessary for correctness, it's still desirable to
wake up a worker so that the requested concurrency level is reached
sooner.
Move wake_up_worker() call from thaw_workqueues() to
pwq_adjust_max_active() so that it can handle both of the above two
cases. This also makes thaw_workqueues() simpler.
tj: Updated comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We can test worker->recue_wq instead of reaching into
current_pwq->wq->rescuer and then comparing it to self.
tj: Commit message.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Process connector can now also detect coredumping events.
Main aim of patch is get notified at start of coredumping, instead of
having to wait for it to finish and then being notified through EXIT
event.
Could be used for instance by process-managers that want to get
notified as soon as possible about process failures, and not
necessarily beeing notified after coredump, which could be in the
order of minutes depending on size of coredump, piping and so on.
Signed-off-by: Jesper Derehag <jderehag@hotmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
get_unbound_pool() forgot to set POOL_FREEZING if workqueue_freezing
is set and a new pool could go out of sync with the global freezing
state.
Fix it by adding POOL_FREEZING if workqueue_freezing. wq_mutex is
already held so no further locking is necessary. This also removes
the unused static variable warning when !CONFIG_FREEZER.
tj: Updated commit message.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The 3rd parameter of flex_array_prealloc() is the number of elements,
not the index of the last element.
The effect of the bug is, when opening cgroup.procs, a flex array will
be allocated and all elements of the array is allocated with
GFP_KERNEL flag, but the last one is GFP_ATOMIC, and if we fail to
allocate memory for it, it'll trigger a BUG_ON().
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Fix symbol versioning on architectures with symbol prefixes. Although
the build was free from warnings the actual modules still wouldn't load
as the ____versions table contained unprefixed symbol names, which were
being compared against the prefixed symbol names when checking the
symbol versions.
This is fixed by modifying modpost to add the symbol prefix to the
____versions table it outputs (Modules.symvers still contains unprefixed
symbol names). The check_modstruct_version() function is also fixed as
it checks the version of the unprefixed "module_layout" symbol which
would no longer work.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jonathan Kliegman <kliegs@chromium.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (use VMLINUX_SYMBOL_STR)
With the recent addition of the custom attributes support, unbound
pools may have allowed cpumask which isn't full. As long as some of
CPUs in the cpumask are online, its workers will maintain cpus_allowed
as set on worker creation; however, once no online CPU is left in
cpus_allowed, the scheduler will reset cpus_allowed of any workers
which get scheduled so that they can execute.
To remain compliant to the user-specified configuration, CPU affinity
needs to be restored when a CPU becomes online for an unbound pool
which doesn't currently have any online CPUs before.
This patch implement restore_unbound_workers_cpumask(), which is
called from CPU_ONLINE for all unbound pools, checks whether the
coming up CPU is the first allowed online one, and, if so, invokes
set_cpus_allowed_ptr() with the configured cpumask on all workers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
rebind_workers() will be reimplemented in a way which makes it mostly
decoupled from the rest of worker management. Move rebind_workers()
so that it's located with other CPU hotplug related functions.
This patch is pure function relocation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Make worker_ida an idr - worker_idr and use it to implement
for_each_pool_worker() which will be used to simplify worker rebinding
on CPU_ONLINE.
pool->worker_idr is protected by both pool->manager_mutex and
pool->lock so that it can be iterated while holding either lock.
* create_worker() allocates ID without installing worker pointer and
installs the pointer later using idr_replace(). This is because
worker ID is needed when creating the actual task to name it and the
new worker shouldn't be visible to iterations before fully
initialized.
* In destroy_worker(), ID removal is moved before kthread_stop().
This is again to guarantee that only fully working workers are
visible to for_each_pool_worker().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
PF_THREAD_BOUND was originally used to mark kernel threads which were
bound to a specific CPU using kthread_bind() and a task with the flag
set allows cpus_allowed modifications only to itself. Workqueue is
currently abusing it to prevent userland from meddling with
cpus_allowed of workqueue workers.
What we need is a flag to prevent userland from messing with
cpus_allowed of certain kernel tasks. In kernel, anyone can
(incorrectly) squash the flag, and, for worker-type usages,
restricting cpus_allowed modification to the task itself doesn't
provide meaningful extra proection as other tasks can inject work
items to the task anyway.
This patch replaces PF_THREAD_BOUND with PF_NO_SETAFFINITY.
sched_setaffinity() checks the flag and return -EINVAL if set.
set_cpus_allowed_ptr() is no longer affected by the flag.
This will allow simplifying workqueue worker CPU affinity management.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQEcBAABAgAGBQJRRkrbAAoJEHm+PkMAQRiGy3oH/jrbHinYs0auurANgx4TdtWT
/WNajstKBqLOJJ6cnTR7sOqwOVlptt65EbbTs+qGyZ2Z2W/Lg0BMenHvNHo4ER8C
e7UbMdBCSLKBjAMKh1XCoZscGv4Exm8WRH3Vc5yP0Hafj3EzSAVLY1dta9WKKoQi
bh7D1ErUlbU1zczA1w5YbPF0LqFKRvyZOwebMCCAKAxv5wWAxmbcPNxVR4sufkjg
k6TkQ2ysgWivZAfy3tJYOcxiEu7ahpZVEuYdlZEJQXHRQUfoNljQlOp4BqKsYUai
5A0kaf2VpKay/7pkhvTfBBcF/jFJ68pYP6gQ2ThNdr0b5kOiAfMWj030Xyngnhg=
=iO9t
-----END PGP SIGNATURE-----
Merge tag 'v3.9-rc3' into drm-intel-next-queued
Backmerge so that I can merge Imre Deak's coalesced sg entries fixes,
which depend upon the new for_each_sg_page introduce in
commit a321e91b6d
Author: Imre Deak <imre.deak@intel.com>
Date: Wed Feb 27 17:02:56 2013 -0800
lib/scatterlist: add simple page iterator
The merge itself is just two trivial conflicts:
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Pull workqueue fix from Tejun Heo:
"Lai's patch to fix highly unlikely but still possible workqueue stall
during CPU hotunplug."
* 'for-3.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix possible pool stall bug in wq_unbind_fn()
On allocation failure, it would fail to free the old attrs array which
was no longer referenced by anything (since it would free the old
module_param_attrs struct on the way out).
Comment the suspicious-looking krealloc() usage to explain why it *isn't*
actually buggy, despite looking like a classic realloc() usage bug.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
It's a per-cpu data structure but missed the __percpu annotation.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Link: http://lkml.kernel.org/r/1363600594-11453-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Thomas noted that we do the wakeup preemption check after the
wakeup trace point, this means the tracepoint cannot test/report
this decision; which is rather important for latency sensitive
workloads. Therefore move the tracepoint after doing the
preemption check.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Paul Turner <pjt@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Link: http://lkml.kernel.org/r/1363254519.26965.9.camel@laptop
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull CPU runtime stats/accounting fixes from Frederic Weisbecker:
" Some users are complaining that their threadgroup's runtime accounting
freezes after a week or so of intense cpu-bound workload. This set tries
to fix the issue by reducing the risk of multiplication overflow in the
cputime scaling code. "
Stanislaw Gruszka further explained the historic context and impact of the
bug:
" Commit 0cf55e1ec0 start to use scalling
for whole thread group, so increase chances of hitting multiplication
overflow, depending on how many CPUs are on the system.
We have multiplication utime * rtime for one thread since commit
b27f03d4bd.
Overflow will happen after:
rtime * utime > 0xffffffffffffffff jiffies
if thread utilize 100% of CPU time, that gives:
rtime > sqrt(0xffffffffffffffff) jiffies
ritme > sqrt(0xffffffffffffffff) / (24 * 60 * 60 * HZ) days
For HZ 100 it will be 497 days for HZ 1000 it will be 49 days.
Bug affect only users, who run CPU intensive application for that
long period. Also they have to be interested on utime,stime values,
as bug has no other visible effect as making those values incorrect. "
Signed-off-by: Ingo Molnar <mingo@kernel.org>
perf_event_task_event() iterates pmu list and generate events
for each eligible pmu context. But if task_event has task_ctx
like in EXIT it'll generate events even though the pmu doesn't
have an eligible one. Fix it by moving the code to proper
places.
Before this patch:
$ perf record -n true
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.006 MB perf.data (~248 samples) ]
$ perf report -D | tail
Aggregated stats:
TOTAL events: 73
MMAP events: 67
COMM events: 2
EXIT events: 4
cycles stats:
TOTAL events: 73
MMAP events: 67
COMM events: 2
EXIT events: 4
After this patch:
$ perf report -D | tail
Aggregated stats:
TOTAL events: 70
MMAP events: 67
COMM events: 2
EXIT events: 1
cycles stats:
TOTAL events: 70
MMAP events: 67
COMM events: 2
EXIT events: 1
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1363332433-7637-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When cpu/task clock events are initialized, their sampling
frequencies are converted to have a fixed value. However it
missed to update the hwc->last_period which was set to 1 for
initial sampling frequency calibration.
Because this hwc->last_period value is used as a period in
perf_swevent_ hrtime(), every recorded sample will have an
incorrected period of 1.
$ perf record -e task-clock noploop 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.158 MB perf.data (~6919 samples) ]
$ perf report -n --show-total-period --stdio
# Samples: 4K of event 'task-clock'
# Event count (approx.): 4000
#
# Overhead Samples Period Command Shared Object Symbol
# ........ ............ ............ ....... ............. ..................
#
99.95% 3998 3998 noploop noploop [.] main
0.03% 1 1 noploop libc-2.15.so [.] init_cacheinfo
0.03% 1 1 noploop ld-2.15.so [.] open_verify
Note that it doesn't affect the non-sampling event so that the
perf stat still gets correct value with or without this patch.
$ perf stat -e task-clock noploop 1
Performance counter stats for 'noploop 1':
1000.272525 task-clock # 1.000 CPUs utilized
1.000560605 seconds time elapsed
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1363574507-18808-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are some new processors whose TSC clocksource won't stop during
suspend. Currently, after system resumes, kernel will use persistent
clock or RTC to compensate the sleep time, but with these nonstop
clocksources, we could skip the special compensation from external
sources, and just use current clocksource for time recounting.
This can solve some time drift bugs caused by some not-so-accurate or
error-prone RTC devices.
The current way to count suspended time is first try to use the persistent
clock, and then try the RTC if persistent clock can't be used. This
patch will change the trying order to:
suspend-nonstop clocksource -> persistent clock -> RTC
When counting the sleep time with nonstop clocksource, use an accurate way
suggested by Jason Gunthorpe to cover very large delta cycles.
Signed-off-by: Feng Tang <feng.tang@intel.com>
[jstultz: Small optimization, avoiding re-reading the clocksource]
Signed-off-by: John Stultz <john.stultz@linaro.org>
When warping the clock (from a local time RTC), use
timekeeping_inject_offset() to atomically add the offset.
This avoids any minor time error caused by the delay between
reading the time, and then setting the adjusted time.
Signed-off-by: John Stultz <john.stultz@linaro.org>
If the Hardware Clock kept in local time,kernel will adjust the time
to be UTC time.But if Hardware Clock kept in UTC time,system will make
a dummy settimeofday call first (sys_tz.tz_minuteswest = 0) to make sure
the time is not shifted,so at this point I think maybe it is not necessary
to set the kernel time once the sys_tz.tz_minuteswest is zero.
Signed-off-by: Dong Zhu <bluezhudong@gmail.com>
[jstultz: Updated to merge with conflicting changes ]
Signed-off-by: John Stultz <john.stultz@linaro.org>
ftrace_dump() had a lot of issues. What ftrace_dump() does, is when
ftrace_dump_on_oops is set (via a kernel parameter or sysctl), it
will dump out the ftrace buffers to the console when either a oops,
panic, or a sysrq-z occurs.
This was written a long time ago when ftrace was fragile to recursion.
But it wasn't written well even for that.
There's a possible deadlock that can occur if a ftrace_dump() is happening
and an NMI triggers another dump. This is because it grabs a lock
before checking if the dump ran.
It also totally disables ftrace, and tracing for no good reasons.
As the ring_buffer now checks if it is read via a oops or NMI, where
there's a chance that the buffer gets corrupted, it will disable
itself. No need to have ftrace_dump() do the same.
ftrace_dump() is now cleaned up where it uses an atomic counter to
make sure only one dump happens at a time. A simple atomic_inc_return()
is enough that is needed for both other CPUs and NMIs. No need for
a spinlock, as if one CPU is running the dump, no other CPU needs
to do it too.
The tracing_on variable is turned off and not turned on. The original
code did this, but it wasn't pretty. By just disabling this variable
we get the result of not seeing traces that happen between crashes.
For sysrq-z, it doesn't get turned on, but the user can always write
a '1' to the tracing_on file. If they are using sysrq-z, then they should
know about tracing_on.
The new code is much easier to read and less error prone. No more
deadlock possibility when an NMI triggers here.
Reported-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_event_mutex is an rw semaphore now, not a mutex, change the name.
Link: http://lkml.kernel.org/r/513D843B.40109@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
[ Forward ported to my new code ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ppc64 has its own syscall prefix like ".SyS" or ".sys". Make the
comment in arch_syscall_match_sym_name() more understandable.
Link: http://lkml.kernel.org/r/513D842F.40205@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_destroy_fields() is not used outside of the file. It can be
a static function.
Link: http://lkml.kernel.org/r/513D842A.2000907@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
By moving find_event_field() and trace_find_field() into trace_events.c,
the ftrace_common_fields list and trace_get_fields() can become local to
the trace_events.c file.
find_event_field() is renamed to trace_find_event_field() to conform to
the tracing global function names.
Link: http://lkml.kernel.org/r/513D8426.9070109@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
[ rostedt: Modified trace_find_field() to trace_find_event_field() ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
TRACE_MAX_PRINT macro is defined, but is not used.
Link: http://lkml.kernel.org/r/513D8421.4070404@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use pr_warn_once, instead of making an open coded implementation.
Link: http://lkml.kernel.org/r/513D8419.20400@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When testing my large changes to the ftrace system, there was
a bug that looked like the ring buffer was dropping events.
I wrote up a quick integrity checker of the ring buffer to
see if it was.
Although the bug ended up being something stupid I did in ftrace,
and had nothing to do with the ring buffer, I figured if I spent
the time to write up this test, I might as well include it in the
kernel.
I cleaned it up a bit, as the original version was rather ugly.
Not saying this version is pretty, but it's a beauty queen
compared to what I original wrote.
To enable the start up test, set CONFIG_RING_BUFFER_STARTUP_TEST.
Note, it runs for 10 seconds, so it will slow your boot time
by at least 10 more seconds.
What it does is documented in both the comments and the Kconfig
help.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We have CONFIG_SYMBOL_PREFIX, which three archs define to the string
"_". But Al Viro broke this in "consolidate cond_syscall and
SYSCALL_ALIAS declarations" (in linux-next), and he's not the first to
do so.
Using CONFIG_SYMBOL_PREFIX is awkward, since we usually just want to
prefix it so something. So various places define helpers which are
defined to nothing if CONFIG_SYMBOL_PREFIX isn't set:
1) include/asm-generic/unistd.h defines __SYMBOL_PREFIX.
2) include/asm-generic/vmlinux.lds.h defines VMLINUX_SYMBOL(sym)
3) include/linux/export.h defines MODULE_SYMBOL_PREFIX.
4) include/linux/kernel.h defines SYMBOL_PREFIX (which differs from #7)
5) kernel/modsign_certificate.S defines ASM_SYMBOL(sym)
6) scripts/modpost.c defines MODULE_SYMBOL_PREFIX
7) scripts/Makefile.lib defines SYMBOL_PREFIX on the commandline if
CONFIG_SYMBOL_PREFIX is set, so that we have a non-string version
for pasting.
(arch/h8300/include/asm/linkage.h defines SYMBOL_NAME(), too).
Let's solve this properly:
1) No more generic prefix, just CONFIG_HAVE_UNDERSCORE_SYMBOL_PREFIX.
2) Make linux/export.h usable from asm.
3) Define VMLINUX_SYMBOL() and VMLINUX_SYMBOL_STR().
4) Make everyone use them.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: James Hogan <james.hogan@imgtec.com>
Tested-by: James Hogan <james.hogan@imgtec.com> (metag)
The function trace_clock() calls "local_clock()" which is exactly
the same clock that perf uses. I'm not sure why perf doesn't call
trace_clock(), as trace_clock() doesn't have any users.
But now it does. As trace_clock() calls local_clock() like perf does,
I added the trace_clock "perf" option that uses trace_clock().
Now the ftrace buffers can use the same clock as perf uses. This
will be useful when perf starts reading the ftrace buffers, and will
be able to interleave them with the same clock data.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a simple trace clock called "uptime" for those that are
interested in the uptime of the trace. It uses jiffies as that's
the safest method, as other uptime clocks grab seq locks, which could
cause a deadlock if taken from an event or function tracer.
Requested-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, the only way to stop the latency tracers from doing function
tracing is to fully disable the function tracer from the proc file
system:
echo 0 > /proc/sys/kernel/ftrace_enabled
This is a big hammer approach as it disables function tracing for
all users. This includes kprobes, perf, stack tracer, etc.
Instead, create a function-trace option that the latency tracers can
check to determine if it should enable function tracing or not.
This option can be set or cleared even while the tracer is active
and the tracers will disable or enable function tracing depending
on how the option was set.
Instead of using the proc file, disable latency function tracing with
echo 0 > /debug/tracing/options/function-trace
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, the depth reported in the stack tracer stack_trace file
does not match the stack_max_size file. This is because the stack_max_size
includes the overhead of stack tracer itself while the depth does not.
The first time a max is triggered, a calculation is not performed that
figures out the overhead of the stack tracer and subtracts it from
the stack_max_size variable. The overhead is stored and is subtracted
from the reported stack size for comparing for a new max.
Now the stack_max_size corresponds to the reported depth:
# cat stack_max_size
4640
# cat stack_trace
Depth Size Location (48 entries)
----- ---- --------
0) 4640 32 _raw_spin_lock+0x18/0x24
1) 4608 112 ____cache_alloc+0xb7/0x22d
2) 4496 80 kmem_cache_alloc+0x63/0x12f
3) 4416 16 mempool_alloc_slab+0x15/0x17
[...]
While testing against and older gcc on x86 that uses mcount instead
of fentry, I found that pasing in ip + MCOUNT_INSN_SIZE let the
stack trace show one more function deep which was missing before.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When gcc 4.6 on x86 is used, the function tracer will use the new
option -mfentry which does a call to "fentry" at every function
instead of "mcount". The significance of this is that fentry is
called as the first operation of the function instead of the mcount
usage of being called after the stack.
This causes the stack tracer to show some bogus results for the size
of the last function traced, as well as showing "ftrace_call" instead
of the function. This is due to the stack frame not being set up
by the function that is about to be traced.
# cat stack_trace
Depth Size Location (48 entries)
----- ---- --------
0) 4824 216 ftrace_call+0x5/0x2f
1) 4608 112 ____cache_alloc+0xb7/0x22d
2) 4496 80 kmem_cache_alloc+0x63/0x12f
The 216 size for ftrace_call includes both the ftrace_call stack
(which includes the saving of registers it does), as well as the
stack size of the parent.
To fix this, if CC_USING_FENTRY is defined, then the stack_tracer
will reserve the first item in stack_dump_trace[] array when
calling save_stack_trace(), and it will fill it in with the parent ip.
Then the code will look for the parent pointer on the stack and
give the real size of the parent's stack pointer:
# cat stack_trace
Depth Size Location (14 entries)
----- ---- --------
0) 2640 48 update_group_power+0x26/0x187
1) 2592 224 update_sd_lb_stats+0x2a5/0x4ac
2) 2368 160 find_busiest_group+0x31/0x1f1
3) 2208 256 load_balance+0xd9/0x662
I'm Cc'ing stable, although it's not urgent, as it only shows bogus
size for item #0, the rest of the trace is legit. It should still be
corrected in previous stable releases.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use the stack of stack_trace_call() instead of check_stack() as
the test pointer for max stack size. It makes it a bit cleaner
and a little more accurate.
Adding stable, as a later fix depends on this patch.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Altough the trace_dump_stack() already skips three functions in
the call to stack trace, which gets the stack trace to start
at the caller of the function, the caller may want to skip some
more too (as it may have helper functions).
Add a skip argument to the trace_dump_stack() that lets the caller
skip back tracing functions that it doesn't care about.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add triggers to function tracer that lets an event get enabled or
disabled when a function is called:
format is:
<function>:enable_event:<system>:<event>[:<count>]
<function>:disable_event:<system>:<event>[:<count>]
echo 'schedule:enable_event:sched:sched_switch' > /debug/tracing/set_ftrace_filter
Every time schedule is called, it will enable the sched_switch event.
echo 'schedule:disable_event:sched:sched_switch:2' > /debug/tracing/set_ftrace_filter
The first two times schedule is called while the sched_switch
event is enabled, it will disable it. It will not count for a time
that the event is already disabled (or enabled for enable_event).
[ fixed return without mutex_unlock() - thanks to Dan Carpenter and smatch ]
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In order to let triggers enable or disable events, we need a 'soft'
method for doing so. For example, if a function probe is added that
lets a user enable or disable events when a function is called, that
change must be done without taking locks or a mutex, and definitely
it can't sleep. But the full enabling of a tracepoint is expensive.
By adding a 'SOFT_DISABLE' flag, and converting the flags to be updated
without the protection of a mutex (using set/clear_bit()), this soft
disable flag can be used to allow critical sections to enable or disable
events from being traced (after the event has been placed into "SOFT_MODE").
Some caveats though: The comm recorder (to map pids with a comm) can not
be soft disabled (yet). If you disable an event with with a "soft"
disable and wait a while before reading the trace, the comm cache may be
replaced and you'll get a bunch of <...> for comms in the trace.
Reading the "enable" file for an event that is disabled will now give
you "0*" where the '*' denotes that the tracepoint is still active but
the event itself is "disabled".
[ fixed _BIT used in & operation : thanks to Dan Carpenter and smatch ]
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The entries to the probe hash must be freed after a synchronize_sched()
after the entry has been removed from the hash.
As the entries are registered with ops that may have their own callbacks,
and these callbacks may sleep, we can not use call_rcu_sched() because
the rcu callbacks registered with that are called from a softirq context.
Instead of using call_rcu_sched(), manually save the entries on a free_list
and at the end of the loop that removes the entries, do a synchronize_sched()
and then go through the free_list, freeing the entries.
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When a function probe is created, each function that the probe is
attached to, a "callback" method is called. On release of the probe,
each function entry calls the "free" method.
First, "callback" is a confusing name and does not really match what
it does. Callback sounds like it will be called when the probe
triggers. But that's not the case. This is really an "init" function,
so lets rename it as such.
Secondly, both "init" and "free" do not pass enough information back
to the handlers. Pass back the ops, ip and data for each time the
method is called. We have the information, might as well use it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
echo 'schedule:snapshot:1' > /debug/tracing/set_ftrace_filter
This will cause the scheduler to trigger a snapshot the next time
it's called (you can use any function that's not called by NMI).
Even though it triggers only once, you still need to remove it with:
echo '!schedule:snapshot:0' > /debug/tracing/set_ftrace_filter
The :1 can be left off for the first command:
echo 'schedule:snapshot' > /debug/tracing/set_ftrace_filter
But this will cause all calls to schedule to trigger a snapshot.
This must be removed without the ':0'
echo '!schedule:snapshot' > /debug/tracing/set_ftrace_filter
As adding a "count" is a different operation (internally).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add alloc_snapshot() and free_snapshot() to allocate and free the
snapshot buffer respectively, and use these to remove duplicate
code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the function probe enables all functions and runs a "hash"
against every function call to see if it should call a probe. This
is extremely wasteful.
Note, a probe is something like:
echo schedule:traceoff > /debug/tracing/set_ftrace_filter
When schedule is called, the probe will disable tracing. But currently,
it has a call back for *all* functions, and checks to see if the
called function is the probe that is needed.
The probe function has been created before ftrace was rewritten to
allow for more than one "op" to be registered by the function tracer.
When probes were created, it couldn't limit the functions without also
limiting normal function calls. But now we can, it's about time
to update the probe code.
Todo, have separate ops for different entries. That is, assign
a ftrace_ops per probe, instead of one op for all probes. But
as there's not many probes assigned, this may not be that urgent.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function tracing probes that trigger traceon or traceoff can be
set to unlimited, or given a count of # of times to execute.
By separating these two types of probes, we can then use the dynamic
ftrace function filtering directly, and remove the brute force
"check if this function called is my probe" routines in ftrace.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The only thing ftrace_trace_onoff_unreg() does is to do a strcmp()
against the cmd parameter to determine what op to unregister. But
this compare is also done after the location that this function is
called (and returns). By moving the check for '!' to unregister after
the strcmp(), the callback function itself can just do the unregister
and we can get rid of the helper function.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove some duplicate code and replace it with a helper function.
This makes the code a it cleaner.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add EXPORT_SYMBOL_GPL() to let the tracing_snapshot() functions be
called from modules.
Also add a test to see if the snapshot was called from NMI context
and just warn in the tracing buffer if so, and return.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's a few places that ftrace uses trace_printk() for internal
use, but this requires context (normal, softirq, irq, NMI) buffers
to keep things lockless. But the trace_puts() does not, as it can
write the string directly into the ring buffer. Make a internal helper
for trace_puts() and have the internal functions use that.
This way the extra context buffers are not used.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_printk() is extremely fast and is very handy as it can be
used in any context (including NMIs!). But it still requires scanning
the fmt string for parsing the args. Even the trace_bprintk() requires
a scan to know what args will be saved, although it doesn't copy the
format string itself.
Several times trace_printk() has no args, and wastes cpu cycles scanning
the fmt string.
Adding trace_puts() allows the developer to use an even faster
tracing method that only saves the pointer to the string in the
ring buffer without doing any format parsing at all. This will
help remove even more of the "Heisenbug" effect, when debugging.
Also fixed up the F_printk()s for the ftrace internal bprint and print events.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The changce to add the trace_buffer struct to have the trace array
have both the main buffer and max buffer broke the branch tracer
because the change did not update that code. As the branch tracer
adds a significant amount of overhead, and must be selected via
a selection (not a allyesconfig) it was missed in testing.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If debugging the kernel, and the developer wants to use
tracing_snapshot() in places where tracing_snapshot_alloc() may
be difficult (or more likely, the developer is lazy and doesn't
want to bother with tracing_snapshot_alloc() at all), then adding
alloc_snapshot
to the kernel command line parameter will tell ftrace to allocate
the snapshot buffer (if configured) when it allocates the main
tracing buffer.
I also noticed that ring_buffer_expanded and tracing_selftest_disabled
had inconsistent use of boolean "true" and "false" with "0" and "1".
I cleaned that up too.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move the tracing startup selftest code into its own function and
when not enabled, always have that function succeed.
This makes the register_tracer() function much more readable.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ring buffer updates when done while the ring buffer is active,
needs to be completed on the CPU that is used for the ring buffer
per_cpu buffer. To accomplish this, schedule_work_on() is used to
schedule work on the given CPU.
Now there's no reason to use schedule_work_on() if the process
doing the update happens to be on the CPU that it is processing.
It has already filled the requirement. Instead, just do the work
and continue.
This is needed for tracing_snapshot_alloc() where it may be called
really early in boot, where the work queues have not been set up yet.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The new snapshot feature is quite handy. It's a way for the user
to take advantage of the spare buffer that, until then, only
the latency tracers used to "snapshot" the buffer when it hit
a max latency. Now users can trigger a "snapshot" manually when
some condition is hit in a program. But a snapshot currently can
not be triggered by a condition inside the kernel.
With the addition of tracing_snapshot() and tracing_snapshot_alloc(),
snapshots can now be taking when a condition is hit, and the
developer wants to snapshot the case without stopping the trace.
Note, any snapshot will overwrite the old one, so take care
in how this is done.
These new functions are to be used like tracing_on(), tracing_off()
and trace_printk() are. That is, they should never be called
in the mainline Linux kernel. They are solely for the purpose
of debugging.
The tracing_snapshot() will not allocate a buffer, but it is
safe to be called from any context (except NMIs). But if a
snapshot buffer isn't allocated when it is called, it will write
to the live buffer, complaining about the lack of a snapshot
buffer, and then stop tracing (giving you the "permanent snapshot").
tracing_snapshot_alloc() will allocate the snapshot buffer if
it was not already allocated and then take the snapshot. This routine
*may sleep*, and must be called from context that can sleep.
The allocation is done with GFP_KERNEL and not atomic.
If you need a snapshot in an atomic context, say in early boot,
then it is best to call the tracing_snapshot_alloc() before then,
where it will allocate the buffer, and then you can use the
tracing_snapshot() anywhere you want and still get snapshots.
Cc: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a ref count to the trace_array structure and prevent removal
of instances that have open descriptors.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the per_cpu directory to the created tracing instances:
cd /sys/kernel/debug/tracing/instances
mkdir foo
ls foo/per_cpu/cpu0
buffer_size_kb snapshot_raw trace trace_pipe_raw
snapshot stats trace_pipe
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the "snapshot" file to the the multi-buffer instances.
cd /sys/kernel/debug/tracing/instances
mkdir foo
ls foo
buffer_size_kb buffer_total_size_kb events free_buffer set_event
snapshot trace trace_clock trace_marker trace_options trace_pipe
tracing_on
cat foo/snapshot
# tracer: nop
#
#
# * Snapshot is freed *
#
# Snapshot commands:
# echo 0 > snapshot : Clears and frees snapshot buffer
# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
# Takes a snapshot of the main buffer.
# echo 2 > snapshot : Clears snapshot buffer (but does not allocate)
# (Doesn't have to be '2' works with any number that
# is not a '0' or '1')
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's a bit of duplicate code in creating the trace buffers for
the normal trace buffer and the max trace buffer among the instances
and the main global_trace. This code can be consolidated and cleaned
up a bit making the code cleaner and more readable as well as less
duplication.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The snapshot buffer belongs to the trace array not the tracer that is
running. The trace array should be the data structure that keeps track
of whether or not the snapshot buffer is allocated, not the tracer
desciptor. Having the trace array keep track of it makes modifications
so much easier.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a 'snapshot_raw' per_cpu file that allows tools to read the raw
binary data of the snapshot buffer.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the preempt or irq latency tracers are enabled, they require
the ring buffer to be able to swap the per cpu sub buffers between
two main buffers. This adds a slight overhead to tracing as the
trace recording needs to perform some checks to synchronize
between recording and swaps that might be happening on other CPUs.
The config RING_BUFFER_ALLOW_SWAP is set when a user of the ring
buffer needs the "swap cpu" feature, otherwise the extra checks
are not implemented and removed from the tracing overhead.
The snapshot feature will swap per CPU if the RING_BUFFER_ALLOW_SWAP
config is set. But that only gets set by things like OPROFILE
and the irqs and preempt latency tracers.
This config is added to let the user decide to include this feature
with the snapshot agnostic from whether or not another user of
the ring buffer sets this config.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the snapshot file into the per_cpu tracing directories to allow
them to be read for an individual cpu. This also allows to clear
an individual cpu from the snapshot buffer.
If the kernel allows it (CONFIG_RING_BUFFER_ALLOW_SWAP is set), then
echoing in '1' into one of the per_cpu snapshot files will do an
individual cpu buffer swap instead of the entire file.
Cc: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, the way the latency tracers and snapshot feature works
is to have a separate trace_array called "max_tr" that holds the
snapshot buffer. For latency tracers, this snapshot buffer is used
to swap the running buffer with this buffer to save the current max
latency.
The only items needed for the max_tr is really just a copy of the buffer
itself, the per_cpu data pointers, the time_start timestamp that states
when the max latency was triggered, and the cpu that the max latency
was triggered on. All other fields in trace_array are unused by the
max_tr, making the max_tr mostly bloat.
This change removes the max_tr completely, and adds a new structure
called trace_buffer, that holds the buffer pointer, the per_cpu data
pointers, the time_start timestamp, and the cpu where the latency occurred.
The trace_array, now has two trace_buffers, one for the normal trace and
one for the max trace or snapshot. By doing this, not only do we remove
the bloat from the max_trace but the instances of traces can now use
their own snapshot feature and not have just the top level global_trace have
the snapshot feature and latency tracers for itself.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The snapshot utility is extremely useful, and does not add any more
overhead in memory when another latency tracer is enabled. They use
the snapshot underneath. There's no reason to hide the snapshot file
when a latency tracer has been enabled in the kernel.
If any of the latency tracers (irq, preempt or wakeup) is enabled
then also select the snapshot facility.
Note, snapshot can be enabled without the latency tracers enabled.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently we do not know what buffer a module event was enabled in.
On unload, it is safest to clear all buffer instances, not just the
top level buffer.
Todo: Clear only the buffer that the event was used in. The
infrastructure is there to do this, but it makes the code a bit
more complex. Lets get the current code vetted before we add that.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, when a module with events is unloaded, the trace buffer is
cleared. This is just a safety net in case the module might have some
strange callback when its event is outputted. But there's no reason
to reset the buffer if the module didn't have any of its events traced.
Add a flag to the event "call" structure called WAS_ENABLED and gets set
when the event is ever enabled, and this flag never gets cleared. When a
module gets unloaded, if any of its events have this flag set, then the
trace buffer will get cleared.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The move of blocked readers to the ring buffer left out the
init of the wait queue that is used. Tests missed this due to running
stress tests against the buffers, which didn't allow for any
readers to end up waiting. Running a simple read and wait triggered
a bug.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As we've added __init annotation to field-defining functions, we should
add __refdata annotation to event_call variables, which reference those
functions.
Link: http://lkml.kernel.org/r/51343C1F.2050502@huawei.com
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The new multi-buffers added a descriptor that kept track of module
events, and the directories they use, with struct ftace_module_file_ops.
This is used to add a ref count to keep modules from unloading while
their files are being accessed.
As the descriptor is only needed when CONFIG_MODULES is enabled, it
is only declared when the config is enabled. But that struct is
dereferenced in a few areas outside the #ifdef CONFIG_MODULES.
By adding some helper routines and moving code around a little,
events can be compiled again without modules.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the conversion of the data array to per cpu, sparse now complains
about the use of per_cpu_ptr() on the variable. But The variable is
allocated with alloc_percpu() and is fine to use. But since the structure
that contains the data variable does not annotate it as such, sparse
gives out a lot of false warnings.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
These two functions are called during kernel boot only.
Link: http://lkml.kernel.org/r/51258796.7020704@huawei.com
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move duplicate code in event print functions to a helper function.
This shrinks the size of the kernel by ~13K.
text data bss dec hex filename
6596137 1743966 10138672 18478775 119f6b7 vmlinux.o.old
6583002 1743849 10138672 18465523 119c2f3 vmlinux.o.new
Link: http://lkml.kernel.org/r/51258746.2060304@huawei.com
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move the logic to wake up on ring buffer data into the ring buffer
code itself. This simplifies the tracing code a lot and also has the
added benefit that waiters on one of the instance buffers can be woken
only when data is added to that instance instead of data added to
any instance.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the ring buffer is empty, a read to trace_pipe_raw wont block.
The tracing code has the infrastructure to wake up waiting readers,
but the trace_pipe_raw doesn't take advantage of that.
When a read is done to trace_pipe_raw without the O_NONBLOCK flag
set, have the read block until there's data in the requested buffer.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_pipe_raw never implemented polling and this was casing
issues for several utilities. This is now implemented.
Blocked reads still are on the TODO list.
Reported-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Tested-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently only the splice NONBLOCK flag is checked to determine if
the splice read should block or not. But the file descriptor NONBLOCK
flag also needs to be checked.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The names used to display the field and type in the event format
files are copied, as well as the system name that is displayed.
All these names are created by constant values passed in.
If one of theses values were to be removed by a module, the module
would also be required to remove any event it created.
By using the strings directly, we can save over 100K of memory.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The event structures used by the trace events are mostly persistent,
but they are also allocated by kmalloc, which is not the best at
allocating space for what is used. By converting these kmallocs
into kmem_cache_allocs, we can save over 50K of space that is
permanently allocated.
After boot we have:
slab name active allocated size
--------- ------ --------- ----
ftrace_event_file 979 1005 56 67 1
ftrace_event_field 2301 2310 48 77 1
The ftrace_event_file has at boot up 979 active objects out of
1005 allocated in the slabs. Each object is 56 bytes. In a normal
kmalloc, that would allocate 64 bytes for each object.
1005 - 979 = 26 objects not used
26 * 56 = 1456 bytes wasted
But if we used kmalloc:
64 - 56 = 8 bytes unused per allocation
8 * 979 = 7832 bytes wasted
7832 - 1456 = 6376 bytes in savings
Doing the same for ftrace_event_field where there's 2301 objects
allocated in a slab that can hold 2310 with 48 bytes each we have:
2310 - 2301 = 9 objects not used
9 * 48 = 432 bytes wasted
A kmalloc would also use 64 bytes per object:
64 - 48 = 16 bytes unused per allocation
16 * 2301 = 36816 bytes wasted!
36816 - 432 = 36384 bytes in savings
This change gives us a total of 42760 bytes in savings. At least
on my machine, but as there's a lot of these persistent objects
for all configurations that use trace points, this is a net win.
Thanks to Ezequiel Garcia for his trace_analyze presentation which
pointed out the wasted space in my code.
Cc: Ezequiel Garcia <elezegarcia@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the new descriptors used to allow multiple buffers in the
tracing directory added, the kernel command line parameter
trace_events=... no longer works. This is because the top level
(global) trace array now has a list of descriptors associated
with the events and the files in the debugfs directory. But in
early bootup, when the command line is processed and the events
enabled, the trace array list of events has not been set up yet.
Without the list of events in the trace array, the setting of
events to record will fail because it would not match any events.
The solution is to set up the top level array in two stages.
The first is to just add the ftrace file descriptors that just point
to the events. This will allow events to be enabled and start tracing.
The second stage is called after the filesystem is set up, and this
stage will create the debugfs event files and directories associated
with the trace array events.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a method to the hijacked dentry descriptor of the
"instances" directory to allow for rmdir to remove an
instance of a multibuffer.
Example:
cd /debug/tracing/instances
mkdir hello
ls
hello/
rmdir hello
ls
Like the mkdir method, the i_mutex is dropped for the instances
directory. The instances directory is created at boot up and can
not be renamed or removed. The trace_types_lock mutex is used to
synchronize adding and removing of instances.
I've run several stress tests with different threads trying to
create and delete directories of the same name, and it has stood
up fine.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the interface ("instances" directory) to add multiple buffers
to ftrace. To create a new instance, simply do a mkdir in the
instances directory:
This will create a directory with the following:
# cd instances
# mkdir foo
# ls foo
buffer_size_kb free_buffer trace_clock trace_pipe
buffer_total_size_kb set_event trace_marker tracing_enabled
events/ trace trace_options tracing_on
Currently only events are able to be set, and there isn't a way
to delete a buffer when one is created (yet).
Note, the i_mutex lock is dropped from the parent "instances"
directory during the mkdir operation. As the "instances" directory
can not be renamed or deleted (created on boot), I do not see
any harm in dropping the lock. The creation of the sub directories
is protected by trace_types_lock mutex, which only lets one
instance get into the code path at a time. If two tasks try to
create or delete directories of the same name, only one will occur
and the other will fail with -EEXIST.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the syscall events record into the global buffer. But if
multiple buffers are in place, then we need to have syscall events
record in the proper buffers.
By adding descriptors to pass to the syscall event functions, the
syscall events can now record into the buffers that have been assigned
to them (one event may be applied to mulitple buffers).
This will allow tracing high volume syscalls along with seldom occurring
syscalls without losing the seldom syscall events.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The global and max-tr currently use static per_cpu arrays for the CPU data
descriptors. But in order to get new allocated trace_arrays, they need to
be allocated per_cpu arrays. Instead of using the static arrays, switch
the global and max-tr to use allocated data.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pass the struct ftrace_event_file *ftrace_file to the
trace_event_buffer_lock_reserve() (new function that replaces the
trace_current_buffer_lock_reserver()).
The ftrace_file holds a pointer to the trace_array that is in use.
In the case of multiple buffers with different trace_arrays, this
allows different events to be recorded into different buffers.
Also fixed some of the stale comments in include/trace/ftrace.h
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The global_trace variable in kernel/trace/trace.c has been kept 'static' and
local to that file so that it would not be used too much outside of that
file. This has paid off, even though there were lots of changes to make
the trace_array structure more generic (not depending on global_trace).
Removal of a lot of direct usages of global_trace is needed to be able to
create more trace_arrays such that we can add multiple buffers.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Both RING_BUFFER_ALL_CPUS and TRACE_PIPE_ALL_CPU are defined as
-1 and used to say that all the ring buffers are to be modified
or read (instead of just a single cpu, which would be >= 0).
There's no reason to keep TRACE_PIPE_ALL_CPU as it is also started
to be used for more than what it was created for, and now that
the ring buffer code added a generic RING_BUFFER_ALL_CPUS define,
we can clean up the trace code to use that instead and remove
the TRACE_PIPE_ALL_CPU macro.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace events for ftrace are all defined via global variables.
The arrays of events and event systems are linked to a global list.
This prevents multiple users of the event system (what to enable and
what not to).
By adding descriptors to represent the event/file relation, as well
as to which trace_array descriptor they are associated with, allows
for more than one set of events to be defined. Once the trace events
files have a link between the trace event and the trace_array they
are associated with, we can create multiple trace_arrays that can
record separate events in separate buffers.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The latency tracers require the buffers to be in overwrite mode,
otherwise they get screwed up. Force the buffers to stay in overwrite
mode when latency tracers are enabled.
Added a flag_changed() method to the tracer structure to allow
the tracers to see what flags are being changed, and also be able
to prevent the change from happing.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Changing the overwrite mode for the ring buffer via the trace
option only sets the normal buffer. But the snapshot buffer could
swap with it, and then the snapshot would be in non overwrite mode
and the normal buffer would be in overwrite mode, even though the
option flag states otherwise.
Keep the two buffers overwrite modes in sync.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Seems that the tracer flags have never been protected from
synchronous writes. Luckily, admins don't usually modify the
tracing flags via two different tasks. But if scripts were to
be used to modify them, then they could get corrupted.
Move the trace_types_lock that protects against tracers changing
to also protect the flags being set.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The watchdog_disabled flag is a bit cryptic. However it's
usefulness is multifold. Uses are:
1. Check if smpboot_register_percpu_thread function passed.
2. Makes sure that user enables and disables the watchdog in
sequence i.e. enable watchdog->disable watchdog->enable watchdog
Unlike enable watchdog->enable watchdog which is wrong.
Signed-off-by: anish kumar <anish198519851985@gmail.com>
[small text cleanups]
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: chuansheng.liu@intel.com
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1363113848-18344-1-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With the recent locking updates, the only thing protected by
workqueue_lock is workqueue->maydays list. Rename workqueue_lock to
wq_mayday_lock.
This patch is pure rename.
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch continues locking cleanup from the previous patch. It
breaks out pool_workqueue synchronization from workqueue_lock into a
new spinlock - pwq_lock. The followings are protected by pwq_lock.
* workqueue->pwqs
* workqueue->saved_max_active
The conversion is straight-forward. workqueue_lock usages which cover
the above two are converted to pwq_lock. New locking label PW added
for things protected by pwq_lock and FR is updated to mean flush_mutex
+ pwq_lock + sched-RCU.
This patch shouldn't introduce any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, workqueue_lock protects most shared workqueue resources -
the pools, workqueues, pool_workqueues, draining, ID assignments,
mayday handling and so on. The coverage has grown organically and
there is no identified bottleneck coming from workqueue_lock, but it
has grown a bit too much and scheduled rebinding changes need the
pools and workqueues to be protected by a mutex instead of a spinlock.
This patch breaks out pool and workqueue synchronization from
workqueue_lock into a new mutex - wq_mutex. The followings are
protected by wq_mutex.
* worker_pool_idr and unbound_pool_hash
* pool->refcnt
* workqueues list
* workqueue->flags, ->nr_drainers
Most changes are mostly straight-forward. workqueue_lock is replaced
with wq_mutex where applicable and workqueue_lock lock/unlocks are
added where wq_mutex conversion leaves data structures not protected
by wq_mutex without locking. irq / preemption flippings were added
where the conversion affects them. Things worth noting are
* New WQ and WR locking lables added along with
assert_rcu_or_wq_mutex().
* worker_pool_assign_id() now expects to be called under wq_mutex.
* create_mutex is removed from get_unbound_pool(). It now just holds
wq_mutex.
This patch shouldn't introduce any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
When a manager creates or destroys workers, the operations are always
done with the manager_mutex held; however, initial worker creation or
worker destruction during pool release don't grab the mutex. They are
still correct as initial worker creation doesn't require
synchronization and grabbing manager_arb provides enough exclusion for
pool release path.
Still, let's make everyone follow the same rules for consistency and
such that lockdep annotations can be added.
Update create_and_start_worker() and put_unbound_pool() to grab
manager_mutex around thread creation and destruction respectively and
add lockdep assertions to create_worker() and destroy_worker().
This patch doesn't introduce any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
get_unbound_pool(), workqueue_cpu_up_callback() and init_workqueues()
have similar code pieces to create and start the initial worker factor
those out into create_and_start_worker().
This patch doesn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Manager operations are currently governed by two mutexes -
pool->manager_arb and ->assoc_mutex. The former is used to decide who
gets to be the manager and the latter to exclude the actual manager
operations including creation and destruction of workers. Anyone who
grabs ->manager_arb must perform manager role; otherwise, the pool
might stall.
Grabbing ->assoc_mutex blocks everyone else from performing manager
operations but doesn't require the holder to perform manager duties as
it's merely blocking manager operations without becoming the manager.
Because the blocking was necessary when [dis]associating per-cpu
workqueues during CPU hotplug events, the latter was named
assoc_mutex. The mutex is scheduled to be used for other purposes, so
this patch gives it a more fitting generic name - manager_mutex - and
updates / adds comments to explain synchronization around the manager
role and operations.
This patch is pure rename / doc update.
Signed-off-by: Tejun Heo <tj@kernel.org>
There's no reason to make these trivial wrappers full (exported)
functions. Inline the followings.
queue_work()
queue_delayed_work()
mod_delayed_work()
schedule_work_on()
schedule_work()
schedule_delayed_work_on()
schedule_delayed_work()
keventd_up()
Signed-off-by: Tejun Heo <tj@kernel.org>
Rename @id argument of for_each_pool() to @pi so that it doesn't get
reused accidentally when for_each_pool() is used in combination with
other iterators.
This patch is purely cosmetic.
Signed-off-by: Tejun Heo <tj@kernel.org>
* Update incorrect and add missing synchronization labels.
* Update incorrect or misleading comments. Add new comments where
clarification is necessary. Reformat / rephrase some comments.
* drain_workqueue() can be used separately from destroy_workqueue()
but its warning message was incorrectly referring to destruction.
Other than the warning message change, this patch doesn't make any
functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Since 9e8cd2f589 ("workqueue: implement apply_workqueue_attrs()"),
init_and_link_pwq() may be called to initialize a new pool_workqueue
for a workqueue which is already online, but the function was setting
pwq->max_active to wq->saved_max_active without proper
synchronization.
Fix it by calling pwq_adjust_max_active() under proper locking instead
of manually setting max_active.
Signed-off-by: Tejun Heo <tj@kernel.org>
Rename pwq_set_max_active() to pwq_adjust_max_active() and move
pool_workqueue->max_active synchronization and max_active
determination logic into it.
The new function should be called with workqueue_lock held for stable
workqueue->saved_max_active, determines the current max_active value
the target pool_workqueue should be using from @wq->saved_max_active
and the state of the associated pool, and applies it with proper
synchronization.
The current two users - workqueue_set_max_active() and
thaw_workqueues() - are updated accordingly. In addition, the manual
freezing handling in __alloc_workqueue_key() and
freeze_workqueues_begin() are replaced with calls to
pwq_adjust_max_active().
This centralizes max_active handling so that it's less error-prone.
Signed-off-by: Tejun Heo <tj@kernel.org>
pwq_set_max_active() is gonna be modified and used during
pool_workqueue init. Move it above init_and_link_pwq().
This patch is pure code reorganization and doesn't introduce any
functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Merge misc fixes from Andrew Morton:
- A bunch of fixes
- Finish off the idr API conversions before someone starts to use the
old interfaces again.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
idr: idr_alloc() shouldn't trigger lowmem warning when preloaded
UAPI: fix endianness conditionals in M32R's asm/stat.h
UAPI: fix endianness conditionals in linux/raid/md_p.h
UAPI: fix endianness conditionals in linux/acct.h
UAPI: fix endianness conditionals in linux/aio_abi.h
decompressors: fix typo "POWERPC"
mm/fremap.c: fix oops on error path
idr: deprecate idr_pre_get() and idr_get_new[_above]()
tidspbridge: convert to idr_alloc()
zcache: convert to idr_alloc()
mlx4: remove leftover idr_pre_get() call
workqueue: convert to idr_alloc()
nfsd: convert to idr_alloc()
nfsd: remove unused get_new_stid()
kernel/signal.c: use __ARCH_HAS_SA_RESTORER instead of SA_RESTORER
signal: always clear sa_restorer on execve
mm: remove_memory(): fix end_pfn setting
include/linux/res_counter.h needs errno.h
idr_get_new*() and friends are about to be deprecated. Convert to the
new idr_alloc() interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__ARCH_HAS_SA_RESTORER is the preferred conditional for use in 3.9 and
later kernels, per Kees.
Cc: Emese Revfy <re.emese@gmail.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Julien Tinnes <jln@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the new signal handlers are set up, the location of sa_restorer is
not cleared, leaking a parent process's address space location to
children. This allows for a potential bypass of the parent's ASLR by
examining the sa_restorer value returned when calling sigaction().
Based on what should be considered "secret" about addresses, it only
matters across the exec not the fork (since the VMAs haven't changed
until the exec). But since exec sets SIG_DFL and keeps sa_restorer,
this is where it should be fixed.
Given the few uses of sa_restorer, a "set" function was not written
since this would be the only use. Instead, we use
__ARCH_HAS_SA_RESTORER, as already done in other places.
Example of the leak before applying this patch:
$ cat /proc/$$/maps
...
7fb9f3083000-7fb9f3238000 r-xp 00000000 fd:01 404469 .../libc-2.15.so
...
$ ./leak
...
7f278bc74000-7f278be29000 r-xp 00000000 fd:01 404469 .../libc-2.15.so
...
1 0 (nil) 0x7fb9f30b94a0
2 4000000 (nil) 0x7f278bcaa4a0
3 4000000 (nil) 0x7f278bcaa4a0
4 0 (nil) 0x7fb9f30b94a0
...
[akpm@linux-foundation.org: use SA_RESTORER for backportability]
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Emese Revfy <re.emese@gmail.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Julien Tinnes <jln@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't allowing sharing the root directory with processes in a
different user namespace. There doesn't seem to be any point, and to
allow it would require the overhead of putting a user namespace
reference in fs_struct (for permission checks) and incrementing that
reference count on practically every call to fork.
So just perform the inexpensive test of forbidding sharing fs_struct
acrosss processes in different user namespaces. We already disallow
other forms of threading when unsharing a user namespace so this
should be no real burden in practice.
This updates setns, clone, and unshare to disallow multiple user
namespaces sharing an fs_struct.
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Because function tracing is very invasive, and can even trace
calls to rcu_read_lock(), RCU access in function tracing is done
with preempt_disable_notrace(). This requires a synchronize_sched()
for updates and not a synchronize_rcu().
Function probes (traceon, traceoff, etc) must be freed after
a synchronize_sched() after its entry has been removed from the
hash. But call_rcu() is used. Fix this by using call_rcu_sched().
Also fix the usage to use hlist_del_rcu() instead of hlist_del().
Cc: stable@vger.kernel.org
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In some situations, randomly delaying RCU grace-period initialization
can cause more trouble than help. This commit therefore restricts this
type of RCU self-torture to runtime, giving it a rest during boot and
shutdown.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If RCU's softirq handler is prevented from executing, an RCU CPU stall
warning can result. Ways to prevent RCU's softirq handler from executing
include: (1) CPU spinning with interrupts disabled, (2) infinite loop
in some softirq handler, and (3) in -rt kernels, an infinite loop in a
set of real-time threads running at priorities higher than that of RCU's
softirq handler.
Because this situation can be difficult to track down, this commit causes
the count of RCU softirq handler invocations to be printed with RCU
CPU stall warnings. This information does require some interpretation,
as now documented in Documentation/RCU/stallwarn.txt.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Some users have reported that after running a process with
hundreds of threads on intensive CPU-bound loads, the cputime
of the group started to freeze after a few days.
This is due to how we scale the tick-based cputime against
the scheduler precise execution time value.
We add the values of all threads in the group and we multiply
that against the sum of the scheduler exec runtime of the whole
group.
This easily overflows after a few days/weeks of execution.
A proposed solution to solve this was to compute that multiplication
on stime instead of utime:
62188451f0
("cputime: Avoid multiplication overflow on utime scaling")
The rationale behind that was that it's easy for a thread to
spend most of its time in userspace under intensive CPU-bound workload
but it's much harder to do CPU-bound intensive long run in the kernel.
This postulate got defeated when a user recently reported he was still
seeing cputime freezes after the above patch. The workload that
triggers this issue relates to intensive networking workloads where
most of the cputime is consumed in the kernel.
To reduce much more the opportunities for multiplication overflow,
lets reduce the multiplication factors to the remainders of the division
between sched exec runtime and cputime. Assuming the difference between
these shouldn't ever be that large, it could work on many situations.
This gets the same results as in the upstream scaling code except for
a small difference: the upstream code always rounds the results to
the nearest integer not greater to what would be the precise result.
The new code rounds to the nearest integer either greater or not
greater. In practice this difference probably shouldn't matter but
it's worth mentioning.
If this solution appears not to be enough in the end, we'll
need to partly revert back to the behaviour prior to commit
0cf55e1ec0
("sched, cputime: Introduce thread_group_times()")
Back then, the scaling was done on exit() time before adding the cputime
of an exiting thread to the signal struct. And then we'll need to
scale one-by-one the live threads cputime in thread_group_cputime(). The
drawback may be a slightly slower code on exit time.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
On the CPU which gets woken along with the target CPU of the broadcast
the following happens:
deep_idle()
<-- spurious wakeup
broadcast_exit()
set forced bit
enable interrupts
<-- Nothing happens
disable interrupts
broadcast_enter()
<-- Here we observe the forced bit is set
deep_idle()
Now after that the target CPU of the broadcast runs the broadcast
handler and finds the other CPU in both the broadcast and the forced
mask, sends the IPI and stuff gets back to normal.
So it's not actually harmful, just more evidence for the theory, that
hardware designers have access to very special drug supplies.
Now there is no point in going back to deep idle just to wake up again
right away via an IPI. Provide a check which allows the idle code to
avoid the deep idle transition.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: LAK <linux-arm-kernel@lists.infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Arjan van de Veen <arjan@infradead.org>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Jason Liu <liu.h.jason@gmail.com>
Link: http://lkml.kernel.org/r/20130306111537.565418308@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Some brilliant hardware implementations wake multiple cores when the
broadcast timer fires. This leads to the following interesting
problem:
CPU0 CPU1
wakeup from idle wakeup from idle
leave broadcast mode leave broadcast mode
restart per cpu timer restart per cpu timer
go back to idle
handle broadcast
(empty mask)
enter broadcast mode
programm broadcast device
enter broadcast mode
programm broadcast device
So what happens is that due to the forced reprogramming of the cpu
local timer, we need to set a event in the future. Now if we manage to
go back to idle before the timer fires, we switch off the timer and
arm the broadcast device with an already expired time (covered by
forced mode). So in the worst case we repeat the above ping pong
forever.
Unfortunately we have no information about what caused the wakeup, but
we can check current time against the expiry time of the local cpu. If
the local event is already in the past, we know that the broadcast
timer is about to fire and send an IPI. So we mark ourself as an IPI
target even if we left broadcast mode and avoid the reprogramming of
the local cpu timer.
This still leaves the possibility that a CPU which is not handling the
broadcast interrupt is going to reach idle again before the IPI
arrives. This can't be solved in the core code and will be handled in
follow up patches.
Reported-by: Jason Liu <liu.h.jason@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: LAK <linux-arm-kernel@lists.infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Arjan van de Veen <arjan@infradead.org>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Link: http://lkml.kernel.org/r/20130306111537.492045206@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If the local cpu timer stops in deep idle, we arm the broadcast device
and get woken by an IPI. Now when we return from deep idle we reenable
the local cpu timer unconditionally before handling the IPI. But
that's a pointless exercise: the timer is already expired and the IPI
is on the way. And it's an expensive exercise as we use the forced
reprogramming mode so that we do not lose a timer event. This forced
reprogramming will loop at least once in the retry.
To avoid this reprogramming, we mark the cpu in a pending bit mask
before we send the IPI. Now when the IPI target cpu wakes up, it will
see the pending bit set and skip the reprogramming. The reprogramming
of the cpu local timer will happen in the IPI handler which runs the
cpu local timer interrupt function.
Reported-by: Jason Liu <liu.h.jason@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: LAK <linux-arm-kernel@lists.infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Arjan van de Veen <arjan@infradead.org>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Link: http://lkml.kernel.org/r/20130306111537.431082074@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix kernel-doc warning in futex.c and convert 'Returns' to the new Return:
kernel-doc notation format.
Warning(kernel/futex.c:2286): Excess function parameter 'clockrt' description in 'futex_wait_requeue_pi'
Fix one spello.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix new kernel-doc warnings in kernel/signal.c:
Warning(kernel/signal.c:2689): No description found for parameter 'uset'
Warning(kernel/signal.c:2689): Excess function parameter 'set' description in 'sys_rt_sigpending'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement a function which queries whether it currently is running off
a workqueue rescuer. This will be used to convert writeback to
workqueue.
Signed-off-by: Tejun Heo <tj@kernel.org>
When we open cgroup.procs, we'll allocate an buffer and store all tasks'
tgid in it, and then duplicate entries will be stripped. If that results
in a much smaller pid list, we'll re-allocate a smaller buffer.
But we've already sucessfully allocated memory and reading the procs
file is a short period and the memory will be freed very soon, so why
bother to re-allocate memory.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
cpuset no longer nests cgroup_mutex inside cpu_hotplug lock, so
we don't have to release cgroup_mutex before calling css_offline().
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Sasha reported a lockdep warning when OOM was triggered. The reason
is cgroup_name() should be called with rcu_read_lock() held.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Although it used to be that CPU_DYING notifiers executed on the outgoing
CPU with interrupts disabled and with all other CPUs spinning, this is
no longer the case. This commit therefore removes this obsolete comment.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Offline CPUs transition through the scheduler to the idle loop one
last time before being shut down. This can result in RCU raising
softirq on this CPU, which is at best useless given that the CPU's
callbacks will be offloaded at CPU_DEAD time. This commit therefore
avoids raising softirq on offline CPUs.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
A function type is typically defined as
typedef ret_type (*func)(args..)
but async_func_ptr is not. Redefine it.
Also rename async_func_ptr to async_func_t for _func_t suffix is more generic.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
The code in lowest_in_progress() are duplicated in two branches,
simplify them.
tj: Minor indentation adjustment.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
There are cases where workqueue users want to expose control knobs to
userland. e.g. Unbound workqueues with custom attributes are
scheduled to be used for writeback workers and depending on
configuration it can be useful to allow admins to tinker with the
priority or allowed CPUs.
This patch implements workqueue_sysfs_register(), which makes the
workqueue visible under /sys/bus/workqueue/devices/WQ_NAME. There
currently are two attributes common to both per-cpu and unbound pools
and extra attributes for unbound pools including nice level and
cpumask.
If alloc_workqueue*() is called with WQ_SYSFS,
workqueue_sysfs_register() is called automatically as part of
workqueue creation. This is the preferred method unless the workqueue
user wants to apply workqueue_attrs before making the workqueue
visible to userland.
v2: Disallow exposing ordered workqueues as ordered workqueues can't
be tuned in any way.
Signed-off-by: Tejun Heo <tj@kernel.org>
Adjusting max_active of or applying new workqueue_attrs to an ordered
workqueue breaks its ordering guarantee. The former is obvious. The
latter is because applying attrs creates a new pwq (pool_workqueue)
and there is no ordering constraint between the old and new pwqs.
Make apply_workqueue_attrs() and workqueue_set_max_active() trigger
WARN_ON() if those operations are requested on an ordered workqueue
and fail / ignore respectively.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
We're gonna add another internal WQ flag. Let's make the distinction
clear. Prefix WQ_DRAINING with __ and move it to bit 16.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Implement apply_workqueue_attrs() which applies workqueue_attrs to the
specified unbound workqueue by creating a new pwq (pool_workqueue)
linked to worker_pool with the specified attributes.
A new pwq is linked at the head of wq->pwqs instead of tail and
__queue_work() verifies that the first unbound pwq has positive refcnt
before choosing it for the actual queueing. This is to cover the case
where creation of a new pwq races with queueing. As base ref on a pwq
won't be dropped without making another pwq the first one,
__queue_work() is guaranteed to make progress and not add work item to
a dead pwq.
init_and_link_pwq() is updated to return the last first pwq the new
pwq replaced, which is put by apply_workqueue_attrs().
Note that apply_workqueue_attrs() is almost identical to unbound pwq
part of alloc_and_link_pwqs(). The only difference is that there is
no previous first pwq. apply_workqueue_attrs() is implemented to
handle such cases and replaces unbound pwq handling in
alloc_and_link_pwqs().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Because per-cpu workqueues have multiple pwqs (pool_workqueues) to
serve the CPUs, to guarantee that a single work item isn't queued on
one pwq while still executing another, __queue_work() takes a look at
the previous pool the target work item was on and if it's still
executing there, queue the work item on that pool.
To support changing workqueue_attrs on the fly, unbound workqueues too
will have multiple pwqs and thus need non-reentrancy test when
queueing. This patch modifies __queue_work() such that the reentrancy
test is performed regardless of the workqueue type.
per_cpu_ptr(wq->cpu_pwqs, cpu) used to be used to determine the
matching pwq for the last pool. This can't be used for unbound
workqueues and is replaced with worker->current_pwq which also happens
to be simpler.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Unbound pwqs (pool_workqueues) will be dynamically created and
destroyed with the scheduled unbound workqueue w/ custom attributes
support. This patch synchronizes pwq linking and unlinking against
flush_workqueue() so that its operation isn't disturbed by pwqs coming
and going.
Linking and unlinking a pwq into wq->pwqs is now protected also by
wq->flush_mutex and a new pwq's work_color is initialized to
wq->work_color during linking. This ensures that pwqs changes don't
disturb flush_workqueue() in progress and the new pwq's work coloring
stays in sync with the rest of the workqueue.
flush_mutex during unlinking isn't strictly necessary but it's simpler
to do it anyway.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Add pool_workqueue->refcnt along with get/put_pwq(). Both per-cpu and
unbound pwqs have refcnts and any work item inserted on a pwq
increments the refcnt which is dropped when the work item finishes.
For per-cpu pwqs the base ref is never dropped and destroy_workqueue()
frees the pwqs as before. For unbound ones, destroy_workqueue()
simply drops the base ref on the first pwq. When the refcnt reaches
zero, pwq_unbound_release_workfn() is scheduled on system_wq, which
unlinks the pwq, puts the associated pool and frees the pwq and wq as
necessary. This needs to be done from a work item as put_pwq() needs
to be protected by pool->lock but release can't happen with the lock
held - e.g. put_unbound_pool() involves blocking operations.
Unbound pool->locks are marked with lockdep subclas 1 as put_pwq()
will schedule the release work item on system_wq while holding the
unbound pool's lock and triggers recursive locking warning spuriously.
This will be used to implement dynamic creation and destruction of
unbound pwqs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
* Move initialization and linking of pool_workqueues into
init_and_link_pwq().
* Make the failure path use destroy_workqueue() once pool_workqueue
initialization succeeds.
These changes are to prepare for dynamic management of pool_workqueues
and don't introduce any functional changes.
While at it, convert list_del(&wq->list) to list_del_init() as a
precaution as scheduled changes will make destruction more complex.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
WQ_RESCUER is superflous. WQ_MEM_RECLAIM indicates that the user
wants a rescuer and testing wq->rescuer for NULL can answer whether a
given workqueue has a rescuer or not. Drop WQ_RESCUER and test
wq->rescuer directly.
This will help simplifying __alloc_workqueue_key() failure path by
allowing it to use destroy_workqueue() on a partially constructed
workqueue, which in turn will help implementing dynamic management of
pool_workqueues.
While at it, clear wq->rescuer after freeing it in
destroy_workqueue(). This is a precaution as scheduled changes will
make destruction more complex.
This patch doesn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
There are gonna be multiple unbound pools. Include pool ID in the
name of unbound kworkers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
All per-cpu pools are standard, so there's no need to use both "cpu"
and "std" and for_each_std_worker_pool() is confusing in that it can
be used only for per-cpu pools.
* s/cpu_std_worker_pools/cpu_worker_pools/
* s/for_each_std_worker_pool()/for_each_cpu_worker_pool()/
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Workqueue no longer makes use of unbound_std_worker_pools[]. All
unbound worker_pools are created dynamically and there's nothing
special about the standard ones. With unbound_std_worker_pools[]
unused, workqueue no longer has places where it needs to treat the
per-cpu pools-cpu and unbound pools together.
Remove unbound_std_worker_pools[] and the helpers wrapping it to
present unified per-cpu and unbound standard worker_pools.
* for_each_std_worker_pool() now only walks through per-cpu pools.
* for_each[_online]_wq_cpu() which don't have any users left are
removed.
* std_worker_pools() and std_worker_pool_pri() are unused and removed.
* get_std_worker_pool() is removed. Its only user -
alloc_and_link_pwqs() - only used it for per-cpu pools anyway. Open
code per_cpu access in alloc_and_link_pwqs() instead.
This patch doesn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
This patch makes unbound worker_pools reference counted and
dynamically created and destroyed as workqueues needing them come and
go. All unbound worker_pools are hashed on unbound_pool_hash which is
keyed by the content of worker_pool->attrs.
When an unbound workqueue is allocated, get_unbound_pool() is called
with the attributes of the workqueue. If there already is a matching
worker_pool, the reference count is bumped and the pool is returned.
If not, a new worker_pool with matching attributes is created and
returned.
When an unbound workqueue is destroyed, put_unbound_pool() is called
which decrements the reference count of the associated worker_pool.
If the refcnt reaches zero, the worker_pool is destroyed in sched-RCU
safe way.
Note that the standard unbound worker_pools - normal and highpri ones
with no specific cpumask affinity - are no longer created explicitly
during init_workqueues(). init_workqueues() only initializes
workqueue_attrs to be used for standard unbound pools -
unbound_std_wq_attrs[]. The pools are spawned on demand as workqueues
are created.
v2: - Comment added to init_worker_pool() explaining that @pool should
be in a condition which can be passed to put_unbound_pool() even
on failure.
- pool->refcnt reaching zero and the pool being removed from
unbound_pool_hash should be dynamic. pool->refcnt is converted
to int from atomic_t and now manipulated inside workqueue_lock.
- Removed an incorrect sanity check on nr_idle in
put_unbound_pool() which may trigger spuriously.
All changes were suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Introduce struct workqueue_attrs which carries worker attributes -
currently the nice level and allowed cpumask along with helper
routines alloc_workqueue_attrs() and free_workqueue_attrs().
Each worker_pool now carries ->attrs describing the attributes of its
workers. All functions dealing with cpumask and nice level of workers
are updated to follow worker_pool->attrs instead of determining them
from other characteristics of the worker_pool, and init_workqueues()
is updated to set worker_pool->attrs appropriately for all standard
pools.
Note that create_worker() is updated to always perform set_user_nice()
and use set_cpus_allowed_ptr() combined with manual assertion of
PF_THREAD_BOUND instead of kthread_bind(). This simplifies handling
random attributes without affecting the outcome.
This patch doesn't introduce any behavior changes.
v2: Missing cpumask_var_t definition caused build failure on some
archs. linux/cpumask.h included.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
This will be used to implement unbound pools with custom attributes.
This patch doesn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
POOL_MANAGING_WORKERS is used to synchronize the manager role.
Synchronizing among workers doesn't need blocking and that's why it's
implemented as a flag.
It got converted to a mutex a while back to add blocking wait from CPU
hotplug path - 6037315269 ("workqueue: use mutex for global_cwq
manager exclusion"). Later it turned out that synchronization among
workers and cpu hotplug need to be done separately. Eventually,
POOL_MANAGING_WORKERS is restored and workqueue->manager_mutex got
morphed into workqueue->assoc_mutex - 552a37e936 ("workqueue: restore
POOL_MANAGING_WORKERS") and b2eb83d123 ("workqueue: rename
manager_mutex to assoc_mutex").
Now, we're gonna need to be able to lock out managers from
destroy_workqueue() to support multiple unbound pools with custom
attributes making it again necessary to be able to block on the
manager role. This patch replaces POOL_MANAGING_WORKERS with
worker_pool->manager_arb.
This patch doesn't introduce any behavior changes.
v2: s/manager_mutex/manager_arb/
Signed-off-by: Tejun Heo <tj@kernel.org>
Make worker_pool_idr protected by workqueue_lock for writes and
sched-RCU protected for reads. Lockdep assertions are added to
for_each_pool() and get_work_pool() and all their users are converted
to either hold workqueue_lock or disable preemption/irq.
worker_pool_assign_id() is updated to hold workqueue_lock when
allocating a pool ID. As idr_get_new() always performs RCU-safe
assignment, this is enough on the writer side.
As standard pools are never destroyed, there's nothing to do on that
side.
The locking is superflous at this point. This is to help
implementation of unbound pools/pwqs with custom attributes.
This patch doesn't introduce any behavior changes.
v2: Updated for_each_pwq() use if/else for the hidden assertion
statement instead of just if as suggested by Lai. This avoids
confusing the following else clause.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Make workqueue->pwqs protected by workqueue_lock for writes and
sched-RCU protected for reads. Lockdep assertions are added to
for_each_pwq() and first_pwq() and all their users are converted to
either hold workqueue_lock or disable preemption/irq.
alloc_and_link_pwqs() is updated to use list_add_tail_rcu() for
consistency which isn't strictly necessary as the workqueue isn't
visible. destroy_workqueue() isn't updated to sched-RCU release pwqs.
This is okay as the workqueue should have on users left by that point.
The locking is superflous at this point. This is to help
implementation of unbound pools/pwqs with custom attributes.
This patch doesn't introduce any behavior changes.
v2: Updated for_each_pwq() use if/else for the hidden assertion
statement instead of just if as suggested by Lai. This avoids
confusing the following else clause.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
get_pwq() takes @cpu, which can also be WORK_CPU_UNBOUND, and @wq and
returns the matching pwq (pool_workqueue). We want to move away from
using @cpu for identifying pools and pwqs for unbound pools with
custom attributes and there is only one user - workqueue_congested() -
which makes use of the WQ_UNBOUND conditional in get_pwq(). All other
users already know whether they're dealing with a per-cpu or unbound
workqueue.
Replace get_pwq() with explicit per_cpu_ptr(wq->cpu_pwqs, cpu) for
per-cpu workqueues and first_pwq() for unbound ones, and open-code
WQ_UNBOUND conditional in workqueue_congested().
Note that this makes workqueue_congested() behave sligntly differently
when @cpu other than WORK_CPU_UNBOUND is specified. It ignores @cpu
for unbound workqueues and always uses the first pwq instead of
oopsing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
workqueue->pool_wq union is used to point either to percpu pwqs
(pool_workqueues) or single unbound pwq. As the first pwq can be
accessed via workqueue->pwqs list, there's no reason for the single
pointer anymore.
Use list_first_entry(workqueue->pwqs) to access the unbound pwq and
drop workqueue->pool_wq.single pointer and the pool_wq union. It
simplifies the code and eases implementing multiple unbound pools w/
custom attributes.
This patch doesn't introduce any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Workqueue is mixing unsigned int and int for @cpu variables. There's
no point in using unsigned int for cpus - many of cpu related APIs
take int anyway. Consistently use int for @cpu variables so that we
can use negative values to mark special ones.
This patch doesn't introduce any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Similar to how pool_workqueue iteration used to be, raising and
servicing mayday requests is based on CPU numbers. It's hairy because
cpumask_t may not be able to handle WORK_CPU_UNBOUND and cpumasks are
assumed to be always set on UP. This is ugly and can't handle
multiple unbound pools to be added for unbound workqueues w/ custom
attributes.
Add workqueue_struct->maydays. When a pool_workqueue needs rescuing,
it gets chained on the list through pool_workqueue->mayday_node and
rescuer_thread() consumes the list until it's empty.
This patch doesn't introduce any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
The three freeze/thaw related functions - freeze_workqueues_begin(),
freeze_workqueues_busy() and thaw_workqueues() - need to iterate
through all pool_workqueues of all freezable workqueues. They did it
by first iterating pools and then visiting all pwqs (pool_workqueues)
of all workqueues and process it if its pwq->pool matches the current
pool. This is rather backwards and done this way partly because
workqueue didn't have fitting iteration helpers and partly to avoid
the number of lock operations on pool->lock.
Workqueue now has fitting iterators and the locking operation overhead
isn't anything to worry about - those locks are unlikely to be
contended and the same CPU visiting the same set of locks multiple
times isn't expensive.
Restructure the three functions such that the flow better matches the
logical steps and pwq iteration is done using for_each_pwq() inside
workqueue iteration.
* freeze_workqueues_begin(): Setting of FREEZING is moved into a
separate for_each_pool() iteration. pwq iteration for clearing
max_active is updated as described above.
* freeze_workqueues_busy(): pwq iteration updated as described above.
* thaw_workqueues(): The single for_each_wq_cpu() iteration is broken
into three discrete steps - clearing FREEZING, restoring max_active,
and kicking workers. The first and last steps use for_each_pool()
and the second step uses pwq iteration described above.
This makes the code easier to understand and removes the use of
for_each_wq_cpu() for walking pwqs, which can't support multiple
unbound pwqs which will be needed to implement unbound workqueues with
custom attributes.
This patch doesn't introduce any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
With the scheduled unbound pools with custom attributes, there will be
multiple unbound pools, so it wouldn't be able to use
for_each_wq_cpu() + for_each_std_worker_pool() to iterate through all
pools.
Introduce for_each_pool() which iterates through all pools using
worker_pool_idr and use it instead of for_each_wq_cpu() +
for_each_std_worker_pool() combination in freeze_workqueues_begin().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Introduce for_each_pwq() which iterates all pool_workqueues of a
workqueue using the recently added workqueue->pwqs list and replace
for_each_pwq_cpu() usages with it.
This is primarily to remove the single unbound CPU assumption from pwq
iteration for the scheduled unbound pools with custom attributes
support which would introduce multiple unbound pwqs per workqueue;
however, it also simplifies iterator users.
Note that pwq->pool initialization is moved to alloc_and_link_pwqs()
as that now is the only place which is explicitly handling the two pwq
types.
This patch doesn't introduce any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Add workqueue_struct->pwqs list and chain all pool_workqueues
belonging to a workqueue there. This will be used to implement
generic pool_workqueue iteration and handle multiple pool_workqueues
for the scheduled unbound pools with custom attributes.
This patch doesn't introduce any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
pool_workqueues need to be aligned to 1 << WORK_STRUCT_FLAG_BITS as
the lower bits of work->data are used for flags when they're pointing
to pool_workqueues.
Due to historical reasons, unbound pool_workqueues are allocated using
kzalloc() with sufficient buffer area for alignment and aligned
manually. The original pointer is stored at the end which free_pwqs()
retrieves when freeing it.
There's no reason for this hackery anymore. Set alignment of struct
pool_workqueue to 1 << WORK_STRUCT_FLAG_BITS, add kmem_cache for
pool_workqueues with proper alignment and replace the hacky alloc and
free implementation with plain kmem_cache_zalloc/free().
In case WORK_STRUCT_FLAG_BITS gets shrunk too much and makes fields of
pool_workqueues misaligned, trigger WARN if the alignment of struct
pool_workqueue becomes smaller than that of long long.
Note that assertion on IS_ALIGNED() is removed from alloc_pwqs(). We
already have another one in pwq init loop in __alloc_workqueue_key().
This patch doesn't introduce any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
workqueue_lock will be used to synchronize areas which require
irq-safety and there isn't much benefit in keeping it not irq-safe.
Make it irq-safe.
This patch doesn't introduce any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Workqueue has been using mostly BUG_ON()s for sanity checks, which
fail unnecessarily harshly when the assertion doesn't hold. Most
assertions can converted to be less drastic such that things can limp
along instead of dying completely. Convert BUG_ON()s to
WARN_ON[_ONCE]()s with softer failure behaviors - e.g. if assertion
check fails in destroy_worker(), trigger WARN and silently ignore
destruction request.
Most conversions are trivial. Note that sanity checks in
destroy_workqueue() are moved above removal from workqueues list so
that it can bail out without side-effects if assertion checks fail.
This patch doesn't introduce any visible behavior changes during
normal operation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Currently, CPU 0 is constrained to not be a no-CBs CPU, and furthermore
at least one no-CBs CPU must remain online at any given time. These
restrictions are problematic in some situations, such as cases where
all CPUs must run a real-time workload that needs to be insulated from
OS jitter and latencies due to RCU callback invocation. This commit
therefore provides no-CBs CPUs a (very crude and energy-inefficient)
way to start and to wait for grace periods independently of the normal
RCU callback mechanisms. This approach allows any or all of the CPUs to
be designated as no-CBs CPUs, and allows any proper subset of the CPUs
(whether no-CBs CPUs or not) to be offlined.
This commit also provides a fix for a locking bug spotted by Xie
ChanglongX <changlongx.xie@intel.com>.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although the swap is wrapped with a spin_lock, the assignment
of the temp buffer used to swap is not within that lock.
It needs to be moved into that lock, otherwise two swaps
happening on two different CPUs, can end up using the wrong
temp buffer to assign in the swap.
Luckily, all current callers of the swap function appear to have
their own locks. But in case something is added that allows two
different callers to call the swap, then there's a chance that
this race can trigger and corrupt the buffers.
New code is coming soon that will allow for this race to trigger.
I've Cc'd stable, so this bug will not show up if someone backports
one of the changes that can trigger this bug.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull perf fixes from Ingo Molnar:
"Misc minor fixes mostly related to tracing"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
s390: Fix a header dependencies related build error
tracing: update documentation of snapshot utility
tracing: Do not return EINVAL in snapshot when not allocated
tracing: Add help of snapshot feature when snapshot is empty
ftrace: Update the kconfig for DYNAMIC_FTRACE
All warnings:
In file included from kernel/sched/core.c:85:0:
kernel/sched/sched.h:1036:39: warning: 'struct sched_domain' declared inside parameter list
kernel/sched/sched.h:1036:39: warning: its scope is only this definition or declaration, which is probably not what you want
It's because struct sched_domain is defined inside #if CONFIG_SMP,
while update_group_power() is declared unconditionally.
Fix this warning by declaring update_group_power() only if
CONFIG_SMP=n.
Build tested with CONFIG_SMP enabled and then disabled.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5137F4BA.2060101@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since multiple pools per cpu have been introduced, wq_unbind_fn() has
a subtle bug which may theoretically stall work item processing. The
problem is two-fold.
* wq_unbind_fn() depends on the worker executing wq_unbind_fn() itself
to start unbound chain execution, which works fine when there was
only single pool. With multiple pools, only the pool which is
running wq_unbind_fn() - the highpri one - is guaranteed to have
such kick-off. The other pool could stall when its busy workers
block.
* The current code is setting WORKER_UNBIND / POOL_DISASSOCIATED of
the two pools in succession without initiating work execution
inbetween. Because setting the flags requires grabbing assoc_mutex
which is held while new workers are created, this could lead to
stalls if a pool's manager is waiting for the previous pool's work
items to release memory. This is almost purely theoretical tho.
Update wq_unbind_fn() such that it sets WORKER_UNBIND /
POOL_DISASSOCIATED, goes over schedule() and explicitly kicks off
execution for a pool and then moves on to the next one.
tj: Updated comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Commit b67bfe0d42 ("hlist: drop the node parameter from iterators")
did a lot of nice changes but also contains two small hunks that seem to
have slipped in accidentally and have no apparent connection to the
intent of the patch.
This reverts the two extraneous changes.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Senna Tschudin <peter.senna@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently tick_check_broadcast_device doesn't reject clock_event_devices
with CLOCK_EVT_FEAT_DUMMY, and may select them in preference to real
hardware if they have a higher rating value. In this situation, the
dummy timer is responsible for broadcasting to itself, and the core
clockevents code may attempt to call non-existent callbacks for
programming the dummy, eventually leading to a panic.
This patch makes tick_check_broadcast_device always reject dummy timers,
preventing this problem.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Jon Medhurst (Tixy) <tixy@linaro.org>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The full dynticks cputime accounting is able to account either
using the tick or the context tracking subsystem. This way
the housekeeping CPU can keep the low overhead tick based
solution.
This latter mode has a low jiffies resolution granularity and
need to be scaled against CFS precise runtime accounting to
improve its result. We are doing this for CONFIG_TICK_CPU_ACCOUNTING,
now we also need to expand it to full dynticks accounting dynamic
off-case as well.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Mats Liljegren <mats.liljegren@enea.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
From the context tracking POV, preempt_schedule_irq() behaves pretty much
like an exception: It can be called anytime and schedule another task.
But currently it doesn't restore the context tracking state of the preempted
code on preempt_schedule_irq() return.
As a result, if preempt_schedule_irq() is called in the tiny frame between
user_enter() and the actual return to userspace, we resume userspace with
the wrong context tracking state.
Fix this by using exception_enter/exit() which are a perfect fit for this
kind of issue.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Mats Liljegren <mats.liljegren@enea.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
To use the tracing snapshot feature, writing a '1' into the snapshot
file causes the snapshot buffer to be allocated if it has not already
been allocated and dose a 'swap' with the main buffer, so that the
snapshot now contains what was in the main buffer, and the main buffer
now writes to what was the snapshot buffer.
To free the snapshot buffer, a '0' is written into the snapshot file.
To clear the snapshot buffer, any number but a '0' or '1' is written
into the snapshot file. But if the file is not allocated it returns
-EINVAL error code. This is rather pointless. It is better just to
do nothing and return success.
Acked-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When cat'ing the snapshot file, instead of showing an empty trace
header like the trace file does, show how to use the snapshot
feature.
Also, this is a good place to show if the snapshot has been allocated
or not. Users may want to "pre allocate" the snapshot to have a fast
"swap" of the current buffer. Otherwise, a swap would be slow and might
fail as it would need to allocate the snapshot buffer, and that might
fail under tight memory constraints.
Here's what it looked like before:
# tracer: nop
#
# entries-in-buffer/entries-written: 0/0 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
Here's what it looks like now:
# tracer: nop
#
#
# * Snapshot is freed *
#
# Snapshot commands:
# echo 0 > snapshot : Clears and frees snapshot buffer
# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
# Takes a snapshot of the main buffer.
# echo 2 > snapshot : Clears snapshot buffer (but does not allocate)
# (Doesn't have to be '2' works with any number that
# is not a '0' or '1')
Acked-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move struct perf_cgroup_info and perf_cgroup to
kernel/perf/core.c, and then we can remove include of cgroup.h.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/513568A0.6020804@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As default_scale_{freq,smt}_power() and update_rt_power() are
used in kernel/sched/fair.c only, annotate them as static
functions.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5135A7AF.8010900@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
They are used internally only.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5135A78E.7040609@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move struct sched_group_power and sched_group and related inline
functions to kernel/sched/sched.h, as they are used internally
only.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5135A77F.2010705@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
They are used internally only.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5135A771.4070104@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull irq fixes and cleanups from Thomas Gleixner:
"Commit e5ab012c32 ("nohz: Make tick_nohz_irq_exit() irq safe") is
the first commit in the series and the minimal necessary bugfix, which
needs to go back into stable.
The remanining commits enforce irq disabling in irq_exit(), sanitize
the hardirq/softirq preempt count transition and remove a bunch of no
longer necessary conditionals."
I personally love getting rid of the very subtle and confusing
IRQ_EXIT_OFFSET thing. Even apart from the whole "more lines removed
than added" thing.
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irq: Don't re-enable interrupts at the end of irq_exit
irq: Remove IRQ_EXIT_OFFSET workaround
Revert "nohz: Make tick_nohz_irq_exit() irq safe"
irq: Sanitize invoke_softirq
irq: Ensure irq_exit() code runs with interrupts disabled
nohz: Make tick_nohz_irq_exit() irq safe
Pull smpboot bugfix from Thomas Gleixner:
"A single bugfix for a regression introduced with the conversion of the
stop machine threads to the generic smpboot thread management
facility"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
stop_machine: Mark per cpu stopper enabled early
subsys[i] is set to NULL in cgroup_unload_subsys() at modular unload,
and that's protected by cgroup_mutex, and then the memory *subsys[i]
resides will be freed.
So this is unsafe without any locking:
if (!ss || ss->module)
...
v2:
- add a comment for enum cgroup_subsys_id
- simplify the comment in cgroup_exit()
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We no longer fail rmdir() when there're still css refs, so we don't
need to check css refs in check_for_release().
This also voids a bug. cgroup_has_css_refs() accesses subsys[i]
without cgroup_mutex, so it can race with cgroup_unload_subsys().
cgroup_has_css_refs()
...
if (ss == NULL || ss->root != cgrp->root)
if ss pointers to net_cls_subsys, and cls_cgroup module is unloaded
right after the former check but before the latter, the memory that
net_cls_subsys resides has become invalid.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Use cgroup_name() instead of cgrp->dentry->name. This makes the code
a bit simpler.
While at it, remove cpuset_name and make cpuset_nodelist a local variable
to cpuset_print_task_mems_allowed().
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
rename() will change dentry->d_name. The result of this race can
be worse than seeing partially rewritten name, but we might access
a stale pointer because rename() will re-allocate memory to hold
a longer name.
As accessing dentry->name must be protected by dentry->d_lock or
parent inode's i_mutex, while on the other hand cgroup-path() can
be called with some irq-safe spinlocks held, we can't generate
cgroup path using dentry->d_name.
Alternatively we make a copy of dentry->d_name and save it in
cgrp->name when a cgroup is created, and update cgrp->name at
rename().
v5: use flexible array instead of zero-size array.
v4: - allocate root_cgroup_name and all root_cgroup->name points to it.
- add cgroup_name() wrapper.
v3: use kfree_rcu() instead of synchronize_rcu() in user-visible path.
v2: make cgrp->name RCU safe.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Rescuers visit different worker_pools to process work items from pools
under pressure. Currently, rescuer->pool is updated outside any
locking and when an outsider looks at a rescuer, there's no way to
tell when and whether rescuer->pool is gonna change. While this
doesn't currently cause any problem, it is nasty.
With recent worker_maybe_bind_and_lock() changes, we can move
rescuer->pool updates inside pool locks such that if rescuer->pool
equals a locked pool, it's guaranteed to stay that way until the pool
is unlocked.
Move rescuer->pool inside pool->lock.
This patch doesn't introduce any visible behavior difference.
tj: Updated the description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
worker_maybe_bind_and_lock() currently takes @worker but only cares
about @worker->pool. This patch updates worker_maybe_bind_and_lock()
to take @pool instead of @worker. This will be used to better define
synchronization rules regarding rescuer->pool updates.
This doesn't introduce any functional change.
tj: Updated the comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
worker_maybe_bind_and_lock() uses both @worker->task and @current at
the same time. As worker_maybe_bind_and_lock() can only be called by
the current worker task, they are always the same.
Update worker_maybe_bind_and_lock() to use %current consistently.
This doesn't introduce any functional change.
tj: Massaged the description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
... and switch i386 to HAVE_SYSCALL_WRAPPERS, killing open-coded
uses of asmlinkage_protect() in a bunch of syscalls.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull more VFS bits from Al Viro:
"Unfortunately, it looks like xattr series will have to wait until the
next cycle ;-/
This pile contains 9p cleanups and fixes (races in v9fs_fid_add()
etc), fixup for nommu breakage in shmem.c, several cleanups and a bit
more file_inode() work"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
constify path_get/path_put and fs_struct.c stuff
fix nommu breakage in shmem.c
cache the value of file_inode() in struct file
9p: if v9fs_fid_lookup() gets to asking server, it'd better have hashed dentry
9p: make sure ->lookup() adds fid to the right dentry
9p: untangle ->lookup() a bit
9p: double iput() in ->lookup() if d_materialise_unique() fails
9p: v9fs_fid_add() can't fail now
v9fs: get rid of v9fs_dentry
9p: turn fid->dlist into hlist
9p: don't bother with private lock in ->d_fsdata; dentry->d_lock will do just fine
more file_inode() open-coded instances
selinux: opened file can't have NULL or negative ->f_path.dentry
(In the meantime, the hlist traversal macros have changed, so this
required a semantic conflict fixup for the newly hlistified fid->dlist)
This adds core architecture support for Imagination's Meta processor
cores, followed by some later miscellaneous arch/metag cleanups and
fixes which I kept separate to ease review:
- Support for basic Meta 1 (ATP) and Meta 2 (HTP) core architecture
- A few fixes all over, particularly for symbol prefixes
- A few privilege protection fixes
- Several cleanups (setup.c includes, split out a lot of metag_ksyms.c)
- Fix some missing exports
- Convert hugetlb to use vm_unmapped_area()
- Copy device tree to non-init memory
- Provide dma_get_sgtable()
Signed-off-by: James Hogan <james.hogan@imgtec.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQIcBAABAgAGBQJRMmVXAAoJEKHZs+irPybfivgP/inEXqJyfw59omQdjwvYcU/a
/u0MJ3UKSNS3U+HknfaFCy/Nwk1dqPLjqqyVC1V6AbUPBXlaEwGcimlNRx2uRjdq
Uh96upMLHsNuF/xiiR477g3RwY0egIJdM1R1bGi3mZ3vVrNQGF+wbni6f61xCWGz
M/4rDglpQvE79oLhYdgj6tidZtHQT0YWtERA9W90zkQWXGYmpFPKBKbfZAi5+rKQ
U6Gpg26orUugzXNaxltJEYKE8gjLTppEabx8DARnItZ4zCMy4dw5RBJ35RFvQw6e
eSmfgTy9w9WqBMY2+QMSgU0KQt1IITCzX7OlOXC0jALQJXoU0WWbOELlBVQLCwF1
T0OcR/5ZP/hIlOk5Dh+e9U3AtbASXdMtqA0ZUe78woH1CBf7Nc/0c0vRg23EdMh8
lnHDJxT/UqskoOYLI4kgWbEdLDy4uTh19U2pVi7VCo7ksLB9Bj9Xc8VSKgscSXTl
OwzN+c4Jgtu8FDFTp+Af4AT8pYGJ08j8L2ErsV2sOv3Q44U5WXdrMz3GSgwXj8+4
wZk3HvdkQVkMD5sJCUZgAswaN6BnbB0pHdCz4wMQ8jR/Ogs015Ipk64Ecym9S/4n
uES7PnDtt/4lb5EyX2ScbvdnZTAFTaaP7OOhC77BOQvbQjIW1tkAcxWJqRry86uS
iM0BFgK6Ohx3geqa5Ft0
=65cR
-----END PGP SIGNATURE-----
Merge tag 'metag-v3.9-rc1-v4' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag
Pull new ImgTec Meta architecture from James Hogan:
"This adds core architecture support for Imagination's Meta processor
cores, followed by some later miscellaneous arch/metag cleanups and
fixes which I kept separate to ease review:
- Support for basic Meta 1 (ATP) and Meta 2 (HTP) core architecture
- A few fixes all over, particularly for symbol prefixes
- A few privilege protection fixes
- Several cleanups (setup.c includes, split out a lot of
metag_ksyms.c)
- Fix some missing exports
- Convert hugetlb to use vm_unmapped_area()
- Copy device tree to non-init memory
- Provide dma_get_sgtable()"
* tag 'metag-v3.9-rc1-v4' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag: (61 commits)
metag: Provide dma_get_sgtable()
metag: prom.h: remove declaration of metag_dt_memblock_reserve()
metag: copy devicetree to non-init memory
metag: cleanup metag_ksyms.c includes
metag: move mm/init.c exports out of metag_ksyms.c
metag: move usercopy.c exports out of metag_ksyms.c
metag: move setup.c exports out of metag_ksyms.c
metag: move kick.c exports out of metag_ksyms.c
metag: move traps.c exports out of metag_ksyms.c
metag: move irq enable out of irqflags.h on SMP
genksyms: fix metag symbol prefix on crc symbols
metag: hugetlb: convert to vm_unmapped_area()
metag: export clear_page and copy_page
metag: export metag_code_cache_flush_all
metag: protect more non-MMU memory regions
metag: make TXPRIVEXT bits explicit
metag: kernel/setup.c: sort includes
perf: Enable building perf tools for Meta
metag: add boot time LNKGET/LNKSET check
metag: add __init to metag_cache_probe()
...
Pull sigprocmask compat fix from Al Viro:
"generic compat_sys_rt_sigprocmask() had a very dumb braino; I'd spent
quite a while staring at the offending commit before finally managing
to spot the idiocy ;-/"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
fix compat_sys_rt_sigprocmask()
Converting bitmask to 32bit granularity is fine, but we'd better
_do_ something with the result. Such as "copy it to userland"...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Some 32 bit architectures require 64 bit values to be aligned (for
example Meta which has 64 bit read/write instructions). These require 8
byte alignment of event data too, so use
!CONFIG_HAVE_64BIT_ALIGNED_ACCESS instead of !CONFIG_64BIT ||
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to decide alignment, and align
buffer_data_page::data accordingly.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org> (previous version subtly different)
Cleanups
Remove kdb ssb command - there is no in kernel disassembler to support it
Remove kdb ll command - Always caused a kernel oops and there were no
bug reports so no one was using this command
Use kernel ARRAY_SIZE macro instead of array computations
Fixes
Stop oops in kdb if user executes kdb_defcmd with args
kdb help command truncated text
ppc64 support for kgdbts
Add missing kconfig option from original kdb port for dealing with
catastrophic kernel crashes such that you can reboot automatically
on continue from kdb
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJRMhJPAAoJEIciOldedpOj25gP+wZn/u3YxOGkhL3CIBOegEwW
KFte7FFNeLNzw8tyEzBMVXok85PJ+/lvm+Xi+a4hhtV7/S+WvwGoWeIg0zCDtrAu
im4m2X1u00Egcztcr4tCk8Lwc6vNm5KZsBsUsbKKi0K8twaWifkxLMEWeq+3WLzh
+1d4TDkziwdZfcLpUHvohmCsCyg1EdjUxTYnWpSJvLrq5lhNvTQBowHWZH06GBcI
MmhC+aHp3myk2/Vwd1AuDq+jIRbH3ORV50wfRONBR7OJ58sOoD9nV8Mr+fy8QLRN
BPv8WxeRSm7X/Hl6nPm9PX7oNQSg+Ba1+5helIIJMdGGWhJbl9AZslFEU40Zn3yS
V7FmS3mwRSA8NCgfZ+CO6311tirSCvtf34+5ttXEntyK1ujaplSPffBP4Ya26y6q
TB2wLr91+3L0LhzrVRj20P13qexsWUW4t9inLzqtDApD3Zljl9Kuwd0febCiLg0r
mgGkpZjK0VI7S7RfkxpmEyU4bFP2lCTzBOKAKOfIV6O88bfqFdv5QCTHODJArf22
ugDGPX3tw++Lyc9otGxyPbxmc7npYGqTD1UEM8xK/7sEBiWJm71jjwHN1bwRwB3o
EC5bcZ6TJgDJnnvYGBTQec74PnUiCu/UnQVFFabzhowQCJTIWSJkPxjnI2BKK1V5
3exPqogclusAhhwopOtG
=Hwbq
-----END PGP SIGNATURE-----
Merge tag 'for_linux-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb
Pull KGDB/KDB fixes and cleanups from Jason Wessel:
"For a change we removed more code than we added. If people aren't
using it we shouldn't be carrying it. :-)
Cleanups:
- Remove kdb ssb command - there is no in kernel disassembler to
support it
- Remove kdb ll command - Always caused a kernel oops and there were
no bug reports so no one was using this command
- Use kernel ARRAY_SIZE macro instead of array computations
Fixes:
- Stop oops in kdb if user executes kdb_defcmd with args
- kdb help command truncated text
- ppc64 support for kgdbts
- Add missing kconfig option from original kdb port for dealing with
catastrophic kernel crashes such that you can reboot automatically
on continue from kdb"
* tag 'for_linux-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
kdb: Remove unhandled ssb command
kdb: Prevent kernel oops with kdb_defcmd
kdb: Remove the ll command
kdb_main: fix help print
kdb: Fix overlap in buffers with strcpy
Fixed dead ifdef block by adding missing Kconfig option.
kdb: Setup basic kdb state before invoking commands via kgdb
kdb: use ARRAY_SIZE where possible
kgdb/kgdbts: support ppc64
kdb: A fix for kdb command table expansion
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJRMZYbAAoJEGnX8d3iisJeFj8P/R1hWohDDUc8pG3+ov9Y2Brt
g7oIVw1udlKIk3HhVwsyT14/UHunfcTCONKKKGmmbfRrLJSMMsSlXvYoAQLozokf
TuaO3Xt5IfERROqTrCDSwdNaAmwIZGsIuI9jWKo4qsXovAL0nc3sR527qMI1o7OE
9X5eIqOJ/rPvOjXYrPqXmvO/DZKYG8PNTH4PYqePes31CsAmDXWBIlgYmgvF3th3
ptwKPgRU/c2wKNpDDJXVQg/bcg9NI2cCnndNrjgXZgyUQrC37ZTdt/IOF5w6FgIW
6i6UbDKXn8MgQAhrXx0Ns/+0kSJZ7eBWmj8hLyrxUzOYlF4rCs/il6ofDRaMO6fv
9LmbNZXYnGICzm1YAxZRK7dm13IbDnltmMc81vISBpJSMTBgqzLWobHnq5/67Wh4
2oUkoc2Tfaw70FnRCewX0x4Qop2YXmXl1KBwdecvzdcKi6Yg+rRH08ur/0yyCyx7
+vAQpPVIuVqCc916qwmCPFaf1UMNnmMStxNH7D1AQHvi1G372NxfXizdYyKFRY9N
f5Q+6DTo1xh2AxuGieSZxBoeK0Rlp4DWTOBD4MMz29y7BRX7LK1U2iS+nW0g8uir
3RdYeAqyCxlJtjJNQX9U8ZT54jUPZgvJWU0udesRN1CBdOSQMjM9OyZsLLRtLeX1
ww2tc7zqhBUyjBej6Itg
=NKkW
-----END PGP SIGNATURE-----
Merge tag 'arc-v3.9-rc1-late' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc
Pull new ARC architecture from Vineet Gupta:
"Initial ARC Linux port with some fixes on top for 3.9-rc1:
I would like to introduce the Linux port to ARC Processors (from
Synopsys) for 3.9-rc1. The patch-set has been discussed on the public
lists since Nov and has received a fair bit of review, specially from
Arnd, tglx, Al and other subsystem maintainers for DeviceTree, kgdb...
The arch bits are in arch/arc, some asm-generic changes (acked by
Arnd), a minor change to PARISC (acked by Helge).
The series is a touch bigger for a new port for 2 main reasons:
1. It enables a basic kernel in first sub-series and adds
ptrace/kgdb/.. later
2. Some of the fallout of review (DeviceTree support, multi-platform-
image support) were added on top of orig series, primarily to
record the revision history.
This updated pull request additionally contains
- fixes due to our GNU tools catching up with the new syscall/ptrace
ABI
- some (minor) cross-arch Kconfig updates."
* tag 'arc-v3.9-rc1-late' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: (82 commits)
ARC: split elf.h into uapi and export it for userspace
ARC: Fixup the current ABI version
ARC: gdbserver using regset interface possibly broken
ARC: Kconfig cleanup tracking cross-arch Kconfig pruning in merge window
ARC: make a copy of flat DT
ARC: [plat-arcfpga] DT arc-uart bindings change: "baud" => "current-speed"
ARC: Ensure CONFIG_VIRT_TO_BUS is not enabled
ARC: Fix pt_orig_r8 access
ARC: [3.9] Fallout of hlist iterator update
ARC: 64bit RTSC timestamp hardware issue
ARC: Don't fiddle with non-existent caches
ARC: Add self to MAINTAINERS
ARC: Provide a default serial.h for uart drivers needing BASE_BAUD
ARC: [plat-arcfpga] defconfig for fully loaded ARC Linux
ARC: [Review] Multi-platform image #8: platform registers SMP callbacks
ARC: [Review] Multi-platform image #7: SMP common code to use callbacks
ARC: [Review] Multi-platform image #6: cpu-to-dma-addr optional
ARC: [Review] Multi-platform image #5: NR_IRQS defined by ARC core
ARC: [Review] Multi-platform image #4: Isolate platform headers
ARC: [Review] Multi-platform image #3: switch to board callback
...
The 'ssb' command can only be handled when we have a disassembler, to check for
branches, so remove the 'ssb' command for now.
Signed-off-by: Vincent Stehlé <vincent.stehle@laposte.net>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
The kdb_defcmd can only be used to display the available command aliases
while using the kernel debug shell. If you try to define a new macro
while the kernel debugger is active it will oops. The debug shell
macros must use pre-allocated memory set aside at the time kdb_init()
is run, and the kdb_defcmd is restricted to only working at the time
that the kdb_init sequence is being run, which only occurs if you
actually activate the kernel debugger.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Recently some code inspection was done after fixing a problem with
kmalloc used while in the kernel debugger context (which is not
legal), and it turned up the fact that kdb ll command will oops the
kernel.
Given that there have been zero bug reports on the command combined
with the fact it will oops the kernel it is clearly not being used.
Instead of fixing it, it will be removed.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
The help command was chopping all the usage instructions such that
they were not readable.
Example:
bta [D|R|S|T|C|Z|E|U|I| Backtrace all processes matching state flag
per_cpu <sym> [<bytes>] [<c Display per_cpu variables
Where as it should look like:
bta [D|R|S|T|C|Z|E|U|I|M|A]
Backtrace all processes matching state flag
per_cpu <sym> [<bytes>] [<cpu>]
Display per_cpu variables
All that is needed is to check the how long the cmd_usage is and jump
to the next line when appropriate.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Maxime reported that strcpy(s->usage, s->usage+1) has no definitive
guarantee that it will work on all archs the same way when you have
overlapping memory. The fix is simple for the kdb code because we
still have the original string memory in the function scope, so we
just have to use that as the argument instead.
Reported-by: Maxime Villard <rustyBSD@gmx.fr>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Although invasive kdb commands are not supported via kgdb, some useful
non-invasive commands like bt* require basic kdb state to be setup before
calling into the kdb code. Factor out some of this code and call it before
and after executing kdb commands via kgdb.
Signed-off-by: Matt Klein <mklein@twitter.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
When locally adding in some additional kdb commands, I stumbled
across an issue with the dynamic expansion of the kdb command table.
When the number of kdb commands exceeds the size of the statically
allocated kdb_base_commands[] array, additional space is allocated in
the kdb_register_repeat() routine.
The unused portion of the newly allocated array was not being initialized
to zero properly and this would result in segfaults when help '?' was
executed or when a search for a non-existing command would traverse the
command table beyond the end of valid command entries and then attempt
to use the non-zeroed area as actual command entries.
Signed-off-by: John Blackwood <john.blackwood@ccur.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Pull thermal management updates from Zhang Rui:
"Highlights:
- introduction of Dove thermal sensor driver.
- introduction of Kirkwood thermal sensor driver.
- introduction of intel_powerclamp thermal cooling device driver.
- add interrupt and DT support for rcar thermal driver.
- add thermal emulation support which allows platform thermal driver
to do software/hardware emulation for thermal issues."
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: (36 commits)
thermal: rcar: remove __devinitconst
thermal: return an error on failure to register thermal class
Thermal: rename thermal governor Kconfig option to avoid generic naming
thermal: exynos: Use the new thermal trend type for quick cooling action.
Thermal: exynos: Add support for temperature falling interrupt.
Thermal: Dove: Add Themal sensor support for Dove.
thermal: Add support for the thermal sensor on Kirkwood SoCs
thermal: rcar: add Device Tree support
thermal: rcar: remove machine_power_off() from rcar_thermal_notify()
thermal: rcar: add interrupt support
thermal: rcar: add read/write functions for common/priv data
thermal: rcar: multi channel support
thermal: rcar: use mutex lock instead of spin lock
thermal: rcar: enable CPCTL to use hardware TSC deciding
thermal: rcar: use parenthesis on macro
Thermal: fix a build warning when CONFIG_THERMAL_EMULATION cleared
Thermal: fix a wrong comment
thermal: sysfs: Add a new sysfs node emul_temp for thermal emulation
PM: intel_powerclamp: off by one in start_power_clamp()
thermal: exynos: Miscellaneous fixes to support falling threshold interrupt
...
Pull block IO core bits from Jens Axboe:
"Below are the core block IO bits for 3.9. It was delayed a few days
since my workstation kept crashing every 2-8h after pulling it into
current -git, but turns out it is a bug in the new pstate code (divide
by zero, will report separately). In any case, it contains:
- The big cfq/blkcg update from Tejun and and Vivek.
- Additional block and writeback tracepoints from Tejun.
- Improvement of the should sort (based on queues) logic in the plug
flushing.
- _io() variants of the wait_for_completion() interface, using
io_schedule() instead of schedule() to contribute to io wait
properly.
- Various little fixes.
You'll get two trivial merge conflicts, which should be easy enough to
fix up"
Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42ca: "hlist: drop the node parameter from iterators").
* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
block: remove redundant check to bd_openers()
block: use i_size_write() in bd_set_size()
cfq: fix lock imbalance with failed allocations
drivers/block/swim3.c: fix null pointer dereference
block: don't select PERCPU_RWSEM
block: account iowait time when waiting for completion of IO request
sched: add wait_for_completion_io[_timeout]
writeback: add more tracepoints
block: add block_{touch|dirty}_buffer tracepoint
buffer: make touch_buffer() an exported function
block: add @req to bio_{front|back}_merge tracepoints
block: add missing block_bio_complete() tracepoint
block: Remove should_sort judgement when flush blk_plug
block,elevator: use new hashtable implementation
cfq-iosched: add hierarchical cfq_group statistics
cfq-iosched: collect stats from dead cfqgs
cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
block: RCU free request_queue
blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
...
Commit 74eed0163d
"irq: Ensure irq_exit() code runs with interrupts disabled"
restore interrupts flags in the end of irq_exit() for archs
that don't define __ARCH_IRQ_EXIT_IRQS_DISABLED.
However always returning from irq_exit() with interrupts
disabled should not be a problem for these archs. Prior to
this commit this was already happening anytime we processed
pending softirqs anyway.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Merge third patch-bumb from Andrew Morton:
"This wraps me up for -rc1.
- Lots of misc stuff and things which were deferred/missed from
patchbombings 1 & 2.
- ocfs2 things
- lib/scatterlist
- hfsplus
- fatfs
- documentation
- signals
- procfs
- lockdep
- coredump
- seqfile core
- kexec
- Tejun's large IDR tree reworkings
- ipmi
- partitions
- nbd
- random() things
- kfifo
- tools/testing/selftests updates
- Sasha's large and pointless hlist cleanup"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (163 commits)
hlist: drop the node parameter from iterators
kcmp: make it depend on CHECKPOINT_RESTORE
selftests: add a simple doc
tools/testing/selftests/Makefile: rearrange targets
selftests/efivarfs: add create-read test
selftests/efivarfs: add empty file creation test
selftests: add tests for efivarfs
kfifo: fix kfifo_alloc() and kfifo_init()
kfifo: move kfifo.c from kernel/ to lib/
arch Kconfig: centralise CONFIG_ARCH_NO_VIRT_TO_BUS
w1: add support for DS2413 Dual Channel Addressable Switch
memstick: move the dereference below the NULL test
drivers/pps/clients/pps-gpio.c: use devm_kzalloc
Documentation/DMA-API-HOWTO.txt: fix typo
include/linux/eventfd.h: fix incorrect filename is a comment
mtd: mtd_stresstest: use prandom_bytes()
mtd: mtd_subpagetest: convert to use prandom library
mtd: mtd_speedtest: use prandom_bytes
mtd: mtd_pagetest: convert to use prandom library
mtd: mtd_oobtest: convert to use prandom library
...
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since kcmp syscall has been implemented (initially on x86 architecture) a
number of other archs wire it up as well: xtensa, sparc, sh, s390, mips,
microblaze, m68k (not taking into account those who uses
<asm-generic/unistd.h> for syscall numbers definitions).
But the Makefile, which turns kcmp.o generation on still depends on former
config-x86. Thus get rid of this limitation and make kcmp.o depend on
CHECKPOINT_RESTORE option.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the wrong comment about the return value of clone_uts_ns()
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Put get/get_uts() into CONFIG_PROC_SYSCTL code block as they are used
only when CONFIG_PROC_SYSCTL is enabled.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The null check of `strchr() + 1' is broken, which is always non-null,
leading to OOB read. Instead, check the result of strchr().
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert to the much saner new idr interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert to the much saner new idr interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert to the much saner new idr interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
idr_destroy() can destroy idr by itself and idr_remove_all() is being
deprecated. Drop its usage.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Though there is no error if we free a NULL pointer, I think we could
avoid this behaviour. Change the code a little in kimage_crash_alloc()
could avoid this kind of unnecessary free.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If kimage_normal_alloc() fails to alloc pages for image->swap_page, it
should call kimage_free_page_list() to free allocated pages in
image->control_pages list before it frees image.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If kimage_normal_alloc() fails to initialize an allocated kimage, it will
free the image but would still set 'rimage', as a result kexec_load will
try to free it again.
This would explode as part of the freeing process is accessing internal
members which point to uninitialized memory.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch exports a PG_hwpoison into vmcoreinfo when
CONFIG_MEMORY_FAILURE is defined. "makedumpfile" needs to read
information of memory, such as 'mem_section', 'zone', 'pageflags' from
vmcore.
We introduce a function into "makedumpfile" to exclude hwpoison page from
vmcore dump. In order to introduce this function, PG_hwpoison flag have
to export into vmcoreinfo.
Signed-off-by: Mitsuhiro Tanino <mitsuhiro.tanino.gm@hitachi.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Mitsuhiro Tanino <mitsuhiro.tanino.gm@hitachi.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hole_end has been checked to make sure it is <= crash_res.end in the while
condition check, so the if condition check is duplicate.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
tAdd adds the values related to buddy system to vmcoreinfo data so that
makedumpfile (dump filtering command) can filter out all free pages with
the new logic.
It's faster than the current logic because it can distinguish free page
by analyzing page structure at the same time as filtering for other
unnecessary pages (e.g. anonymous page).
OTOH, the current logic has to trace free_list to distinguish free pages
while analyzing page structure to filter out other unnecessary pages.
The new logic uses the fact that buddy page is marked by _mapcount ==
PAGE_BUDDY_MAPCOUNT_VALUE. But, _mapcount shares its memory with other
fields for SLAB/SLUB when PG_slab is set, so we need to check if PG_slab
is set or not before looking up _mapcount value. And we can get the
order of buddy system from private field. To sum it up, the values
below are required for this logic.
Required values:
- OFFSET(page._mapcount)
- OFFSET(page.private)
- NUMBER(PG_slab)
- NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE)
Changelog from v1 to v2:
1. remove SIZE(pageflags)
The new logic was changed after I sent v1 patch.
Accordingly, SIZE(pageflags) has been unnecessary for makedumpfile.
What's makedumpfile:
makedumpfile creates a small dumpfile by excluding unnecessary pages
for the analysis. To distinguish unnecessary pages, makedumpfile gets
the vmcoreinfo data which has the minimum debugging information only
for dump filtering.
Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If new_nsproxy is set we will always call switch_task_namespaces and
then set new_nsproxy back to NULL so the reassignment and fall through
check are redundant
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prevents hung_task detector from panicing the machine. This is also
needed to prevent this wait from blocking suspend.
(It doesnt' currently block suspend but it would once the next
patch in this series is applied.)
[yongjun_wei@trendmicro.com.cn: kernel/exit.c: remove duplicated include]
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Cc: Ben Chan <benchan@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We shouldn't try_to_freeze if locks are held. Holding a lock can cause a
deadlock if the lock is later acquired in the suspend or hibernate path
(e.g. by dpm). Holding a lock can also cause a deadlock in the case of
cgroup_freezer if a lock is held inside a frozen cgroup that is later
acquired by a process outside that group.
[akpm@linux-foundation.org: export debug_check_no_locks_held]
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Cc: Ben Chan <benchan@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The existing SUID_DUMP_* defines duplicate the newer SUID_DUMPABLE_*
defines introduced in 54b501992d ("coredump: warn about unsafe
suid_dumpable / core_pattern combo"). Remove the new ones, and use the
prior values instead.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Chen Gang <gang.chen@asianux.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alan Cox <alan@linux.intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: James Morris <james.l.morris@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several printk's were missing KERN_INFO and KERN_CONT flags. In
addition, a printk that was outside a #if/#endif should have been
inside, which would result in stray blank line on non-x86 boxes.
Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The idea is simple. We need to get the siginfo for each signal on
checkpointing dump, and then return it back on restore.
The first problem is that the kernel doesn't report complete siginfos to
userspace. In a signal handler the kernel strips SI_CODE from siginfo.
When a siginfo is received from signalfd, it has a different format with
fixed sizes of fields. The interface of signalfd was extended. If a
signalfd is created with the flag SFD_RAW, it returns siginfo in a raw
format.
rt_sigqueueinfo looks suitable for restoring signals, but it can't send
siginfo with a positive si_code, because these codes are reserved for
the kernel. In the real world each person has right to do anything with
himself, so I think a process should able to send any siginfo to itself.
This patch:
The kernel prevents sending of siginfo with positive si_code, because
these codes are reserved for kernel. I think we can allow a task to
send such a siginfo to itself. This operation should not be dangerous.
This functionality is required for restoring signals in
checkpoint/restart.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__orderly_poweroff() does argv_free() if call_usermodehelper_fns()
returns -ENOMEM. As Lucas pointed out, this can be wrong if -ENOMEM was
not triggered by the failing call_usermodehelper_setup(), in this case
both __orderly_poweroff() and argv_cleanup() can do kfree().
Kill argv_cleanup() and change __orderly_poweroff() to call argv_free()
unconditionally like do_coredump() does. This info->cleanup() is not
needed (and wrong) since 6c0c0d4d "fix bug in orderly_poweroff() which
did the UMH_NO_WAIT => UMH_WAIT_EXEC change, we can rely on the fact
that CLONE_VFORK can't return until do_execve() succeeds/fails.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: hongfeng <hongfeng@marvell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The prompt to enable DYNAMIC_FTRACE (the ability to nop and
enable function tracing at run time) had a confusing statement:
"enable/disable ftrace tracepoints dynamically"
This was written before tracepoints were added to the kernel,
but now that tracepoints have been added, this is very confusing
and has confused people enough to give wrong information during
presentations.
Not only that, I looked at the help text, and it still references
that dreaded daemon that use to wake up once a second to update
the nop locations and brick NICs, that hasn't been around for over
five years.
Time to bring the text up to the current decade.
Cc: stable@vger.kernel.org
Reported-by: Ezequiel Garcia <elezegarcia@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull kbuild changes from Michal Marek:
- Alias generation in modpost is cross-compile safe.
- kernel/timeconst.h is now generated using a bc script instead of
perl.
- scripts/link-vmlinux.sh now works with an alternative
$KCONFIG_CONFIG.
- destination-y for exported headers is supported in Kbuild files
again.
- depmod is called with -P $CONFIG_SYMBOL_PREFIX on architectures that
need it.
- CONFIG_DEBUG_INFO_REDUCED disables var-tracking
- scripts/setlocalversion works with too much translated locales ;)
* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
kbuild: Fix reading of .config in link-vmlinux.sh
kbuild: Unset language specific variables in setlocalversion script
Kbuild: Disable var tracking with CONFIG_DEBUG_INFO_REDUCED
depmod: pass -P $CONFIG_SYMBOL_PREFIX
kbuild: Fix destination-y for installed headers
scripts/link-vmlinux.sh: source variables from KCONFIG_CONFIG
kernel: Replace timeconst.pl with a bc script
mod/file2alias: make modalias generation safe for cross compiling
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.
The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.
Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.
PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
Pull scheduler fixes from Ingo Molnar.
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cputime: Use local_clock() for full dynticks cputime accounting
cputime: Constify timeval_to_cputime(timeval) argument
sched: Move RR_TIMESLICE from sysctl.h to rt.h
sched: Fix /proc/sched_debug failure on very very large systems
sched: Fix /proc/sched_stat failure on very very large systems
sched/core: Remove the obsolete and unused nr_uninterruptible() function
commit 14e568e78 (stop_machine: Use smpboot threads) introduced the
following regression:
Before this commit the stopper enabled bit was set in the online
notifier.
CPU0 CPU1
cpu_up
cpu online
hotplug_notifier(ONLINE)
stopper(CPU1)->enabled = true;
...
stop_machine()
The conversion to smpboot threads moved the enablement to the wakeup
path of the parked thread. The majority of users seem to have the
following working order:
CPU0 CPU1
cpu_up
cpu online
unpark_threads()
wakeup(stopper[CPU1])
....
stopper thread runs
stopper(CPU1)->enabled = true;
stop_machine()
But Konrad and Sander have observed:
CPU0 CPU1
cpu_up
cpu online
unpark_threads()
wakeup(stopper[CPU1])
....
stop_machine()
stopper thread runs
stopper(CPU1)->enabled = true;
Now the stop machinery kicks CPU0 into the stop loop, where it gets
stuck forever because the queue code saw stopper(CPU1)->enabled ==
false, so CPU0 waits for CPU1 to enter stomp_machine, but the CPU1
stopper work got discarded due to enabled == false.
Add a pre_unpark function to the smpboot thread descriptor and call it
before waking the thread.
This fixes the problem at hand, but the stop_machine code should be
more robust. The stopper->enabled flag smells fishy at best.
Thanks to Konrad for going through a loop of debug patches and
providing the information to decode this issue.
Reported-and-tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reported-and-tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1302261843240.22263@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull drm merge from Dave Airlie:
"Highlights:
- TI LCD controller KMS driver
- TI OMAP KMS driver merged from staging
- drop gma500 stub driver
- the fbcon locking fixes
- the vgacon dirty like zebra fix.
- open firmware videomode and hdmi common code helpers
- major locking rework for kms object handling - pageflip/cursor
won't block on polling anymore!
- fbcon helper and prime helper cleanups
- i915: all over the map, haswell power well enhancements, valleyview
macro horrors cleaned up, killing lots of legacy GTT code,
- radeon: CS ioctl unification, deprecated UMS support, gpu reset
rework, VM fixes
- nouveau: reworked thermal code, external dp/tmds encoder support
(anx9805), fences sleep instead of polling,
- exynos: all over the driver fixes."
Lovely conflict in radeon/evergreen_cs.c between commit de0babd60d
("drm/radeon: enforce use of radeon_get_ib_value when reading user cmd")
and the new changes that modified that evergreen_dma_cs_parse()
function.
* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (508 commits)
drm/tilcdc: only build on arm
drm/i915: Revert hdmi HDP pin checks
drm/tegra: Add list of framebuffers to debugfs
drm/tegra: Fix color expansion
drm/tegra: Split DC_CMD_STATE_CONTROL register write
drm/tegra: Implement page-flipping support
drm/tegra: Implement VBLANK support
drm/tegra: Implement .mode_set_base()
drm/tegra: Add plane support
drm/tegra: Remove bogus tegra_framebuffer structure
drm: Add consistency check for page-flipping
drm/radeon: Use generic HDMI infoframe helpers
drm/tegra: Use generic HDMI infoframe helpers
drm: Add EDID helper documentation
drm: Add HDMI infoframe helpers
video: Add generic HDMI infoframe helpers
drm: Add some missing forward declarations
drm: Move mode tables to drm_edid.c
drm: Remove duplicate drm_mode_cea_vic()
gma500: Fix n, m1 and m2 clock limits for sdvo and lvds
...
Pull user namespace and namespace infrastructure changes from Eric W Biederman:
"This set of changes starts with a few small enhnacements to the user
namespace. reboot support, allowing more arbitrary mappings, and
support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
user namespace root.
I do my best to document that if you care about limiting your
unprivileged users that when you have the user namespace support
enabled you will need to enable memory control groups.
There is a minor bug fix to prevent overflowing the stack if someone
creates way too many user namespaces.
The bulk of the changes are a continuation of the kuid/kgid push down
work through the filesystems. These changes make using uids and gids
typesafe which ensures that these filesystems are safe to use when
multiple user namespaces are in use. The filesystems converted for
3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
changes for these filesystems were a little more involved so I split
the changes into smaller hopefully obviously correct changes.
XFS is the only filesystem that remains. I was hoping I could get
that in this release so that user namespace support would be enabled
with an allyesconfig or an allmodconfig but it looks like the xfs
changes need another couple of days before it they are ready."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
cifs: Enable building with user namespaces enabled.
cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
cifs: Convert struct cifs_sb_info to use kuids and kgids
cifs: Modify struct smb_vol to use kuids and kgids
cifs: Convert struct cifsFileInfo to use a kuid
cifs: Convert struct cifs_fattr to use kuid and kgids
cifs: Convert struct tcon_link to use a kuid.
cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
cifs: Convert from a kuid before printing current_fsuid
cifs: Use kuids and kgids SID to uid/gid mapping
cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
cifs: Override unmappable incoming uids and gids
nfsd: Enable building with user namespaces enabled.
nfsd: Properly compare and initialize kuids and kgids
nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
nfsd: Modify nfsd4_cb_sec to use kuids and kgids
nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
nfsd: Convert nfsxdr to use kuids and kgids
nfsd: Convert nfs3xdr to use kuids and kgids
...
lockdep, but it's a mechanical change.
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJRJAcuAAoJENkgDmzRrbjxsw0P/3eXb+LddYnx0V0uHYdKpCUf
4vdW7X0fX3Z+aUK69IWRL/6ahoO4TpaHYGHBDjEoivyQ0GDq14X7JNWsYYt3LdMf
3wmDgRc2cn/mZOJbFeVpNV8ox5l/xc0CUvV+iQ8tMjfQItXMXgWUFZKMECsXKSO6
eex3lrw9M2jAX2uL8LQPp9W8xtKu24nSZRC6tH5riE/8fCzi1cZPPAqfxP5c8Lee
ZXtbCRSyAFENZLpKyMe1PC7HvtJyi5NDn9xwOQiXULZV/VOlvP94DGBLIKCM/6dn
4QvZxpG0P0uOlpCgRAVLyh/z7g4XY4VF/fHopLCmEcqLsvgD+V2LQpQ9zWUalLPC
Z+pUpz2vu0gIddPU1nR8R6oGpEdJ8O12aJle62p/RSXWZGx12qUQ+Tamu0tgKcv1
AsiJfbUGNDYfxgU6sHsoQjl2f68LTVckCU1C1LqEbW/S104EIORtGx30CHM4LRiO
32kDC5TtgYDBKQAIqJ4bL48ZMh+9W3uX40p7xzOI5khHQjvswUKa3jcxupU0C1uv
lx8KXo7pn8WT33QGysWC782wJCgJuzSc2vRn+KQoqoynuHGM6agaEtR59gil3QWO
rQEcxH63BBRDgHlg4FM9IkJwwsnC3PWKL8gbX0uAWXAPMbgapJkuuGZAwt0WDGVK
+GszxsFkCjlW0mK0egTb
=tiSY
-----END PGP SIGNATURE-----
Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module update from Rusty Russell:
"The sweeping change is to make add_taint() explicitly indicate whether
to disable lockdep, but it's a mechanical change."
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
MODSIGN: Add option to not sign modules during modules_install
MODSIGN: Add -s <signature> option to sign-file
MODSIGN: Specify the hash algorithm on sign-file command line
MODSIGN: Simplify Makefile with a Kconfig helper
module: clean up load_module a little more.
modpost: Ignore ARC specific non-alloc sections
module: constify within_module_*
taint: add explicit flag to show whether lock dep is still OK.
module: printk message when module signature fail taints kernel.
Running the full dynticks cputime accounting with preemptible
kernel debugging trigger the following warning:
[ 4.488303] BUG: using smp_processor_id() in preemptible [00000000] code: init/1
[ 4.490971] caller is native_sched_clock+0x22/0x80
[ 4.493663] Pid: 1, comm: init Not tainted 3.8.0+ #13
[ 4.496376] Call Trace:
[ 4.498996] [<ffffffff813410eb>] debug_smp_processor_id+0xdb/0xf0
[ 4.501716] [<ffffffff8101e642>] native_sched_clock+0x22/0x80
[ 4.504434] [<ffffffff8101db99>] sched_clock+0x9/0x10
[ 4.507185] [<ffffffff81096ccd>] fetch_task_cputime+0xad/0x120
[ 4.509916] [<ffffffff81096dd5>] task_cputime+0x35/0x60
[ 4.512622] [<ffffffff810f146e>] acct_update_integrals+0x1e/0x40
[ 4.515372] [<ffffffff8117d2cf>] do_execve_common+0x4ff/0x5c0
[ 4.518117] [<ffffffff8117cf14>] ? do_execve_common+0x144/0x5c0
[ 4.520844] [<ffffffff81867a10>] ? rest_init+0x160/0x160
[ 4.523554] [<ffffffff8117d457>] do_execve+0x37/0x40
[ 4.526276] [<ffffffff810021a3>] run_init_process+0x23/0x30
[ 4.528953] [<ffffffff81867aac>] kernel_init+0x9c/0xf0
[ 4.531608] [<ffffffff8188356c>] ret_from_fork+0x7c/0xb0
We use sched_clock() to perform and fixup the cputime
accounting. However we are calling it with preemption enabled
from the read side, which trigger the bug above.
To fix this up, use local_clock() instead. It takes care of
preemption and also provide a more reliable clock source. This
is welcome for this kind of statistic that is widely relied on
in userspace.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Ingo Molnar <mingo@kernel.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kevin Hilman <khilman@linaro.org>
Link: http://lkml.kernel.org/r/1361636925-22288-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull signal handling cleanups from Al Viro:
"This is the first pile; another one will come a bit later and will
contain SYSCALL_DEFINE-related patches.
- a bunch of signal-related syscalls (both native and compat)
unified.
- a bunch of compat syscalls switched to COMPAT_SYSCALL_DEFINE
(fixing several potential problems with missing argument
validation, while we are at it)
- a lot of now-pointless wrappers killed
- a couple of architectures (cris and hexagon) forgot to save
altstack settings into sigframe, even though they used the
(uninitialized) values in sigreturn; fixed.
- microblaze fixes for delivery of multiple signals arriving at once
- saner set of helpers for signal delivery introduced, several
architectures switched to using those."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (143 commits)
x86: convert to ksignal
sparc: convert to ksignal
arm: switch to struct ksignal * passing
alpha: pass k_sigaction and siginfo_t using ksignal pointer
burying unused conditionals
make do_sigaltstack() static
arm64: switch to generic old sigaction() (compat-only)
arm64: switch to generic compat rt_sigaction()
arm64: switch compat to generic old sigsuspend
arm64: switch to generic compat rt_sigqueueinfo()
arm64: switch to generic compat rt_sigpending()
arm64: switch to generic compat rt_sigprocmask()
arm64: switch to generic sigaltstack
sparc: switch to generic old sigsuspend
sparc: COMPAT_SYSCALL_DEFINE does all sign-extension as well as SYSCALL_DEFINE
sparc: kill sign-extending wrappers for native syscalls
kill sparc32_open()
sparc: switch to use of generic old sigaction
sparc: switch sys_compat_rt_sigaction() to COMPAT_SYSCALL_DEFINE
mips: switch to generic sys_fork() and sys_clone()
...
Merge second patch-bomb from Andrew Morton:
- A little DM fix
- the MM queue
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (154 commits)
ksm: allocate roots when needed
mm: cleanup "swapcache" in do_swap_page
mm,ksm: swapoff might need to copy
mm,ksm: FOLL_MIGRATION do migration_entry_wait
ksm: shrink 32-bit rmap_item back to 32 bytes
ksm: treat unstable nid like in stable tree
ksm: add some comments
tmpfs: fix mempolicy object leaks
tmpfs: fix use-after-free of mempolicy object
mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
mm: export mmu notifier invalidates
mm: accelerate mm_populate() treatment of THP pages
mm: use long type for page counts in mm_populate() and get_user_pages()
mm: accurately document nr_free_*_pages functions with code comments
HWPOISON: change order of error_states[]'s elements
HWPOISON: fix misjudgement of page_action() for errors on mlocked pages
memcg: stop warning on memcg_propagate_kmem
net: change type of virtio_chan->p9_max_pages
vmscan: change type of vm_total_pages to unsigned long
fs/nfsd: change type of max_delegations, nfsd_drc_max_mem and nfsd_drc_mem_used
...
When calculating amount of dirtyable memory, min_free_kbytes should be
subtracted because it is not intended for dirty pages.
Addresses http://bugs.debian.org/695182
[akpm@linux-foundation.org: fix up min_free_kbytes extern declarations]
[akpm@linux-foundation.org: fix min() warning]
Signed-off-by: Paul Szabo <psz@maths.usyd.edu.au>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu)
will return -1. As a result, cpumask_of_node(nid) will return NULL. In
this case, find_next_bit() in for_each_cpu will get a NULL pointer and
cause panic.
Here is a call trace:
Call Trace:
<IRQ>
select_fallback_rq+0x71/0x190
try_to_wake_up+0x2cb/0x2f0
wake_up_process+0x15/0x20
hrtimer_wakeup+0x22/0x30
__run_hrtimer+0x83/0x320
hrtimer_interrupt+0x106/0x280
smp_apic_timer_interrupt+0x69/0x99
apic_timer_interrupt+0x6f/0x80
There is a hrtimer process sleeping, whose cpu has already been
offlined. When it is waken up, it tries to find another cpu to run, and
get a -1 nid. As a result, cpumask_of_node(-1) returns NULL, and causes
ernel panic.
This patch fixes this problem by judging if the nid is -1. If nid is
not -1, a cpu on the same node will be picked. Else, a online cpu on
another node will be picked.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull core locking changes from Ingo Molnar:
"The biggest change is the rwsem lock-steal improvements, both to the
assembly optimized and the spinlock based variants.
The other notable change is the clean up of the seqlock implementation
to be based on the seqcount infrastructure.
The rest is assorted smaller debuggability, cleanup and continued -rt
locking changes."
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rwsem-spinlock: Implement writer lock-stealing for better scalability
futex: Revert "futex: Mark get_robust_list as deprecated"
generic: Use raw local irq variant for generic cmpxchg
lockdep: Selftest: convert spinlock to raw spinlock
seqlock: Use seqcount infrastructure
seqlock: Remove unused functions
ntp: Make ntp_lock raw
intel_idle: Convert i7300_idle_lock to raw_spinlock
locking: Various static lock initializer fixes
lockdep: Print more info when MAX_LOCK_DEPTH is exceeded
rwsem: Implement writer lock-stealing for better scalability
lockdep: Silence warning if CONFIG_LOCKDEP isn't set
watchdog: Use local_clock for get_timestamp()
lockdep: Rename print_unlock_inbalance_bug() to print_unlock_imbalance_bug()
locking/stat: Fix a typo
On systems with 4096 cores attemping to read /proc/sched_debug
fails because we are trying to push all the data into a single
kmalloc buffer.
The issue is on these very large machines all the data will not
fit in 4mb.
A better solution is to not us the single_open mechanism but to
provide our own seq_operations and treat each cpu as an
individual record.
The output should be identical to the previous version.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>)
[ Whitespace fixlet]
[ Fix spello in comment]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On systems with 4096 cores doing a cat /proc/sched_stat fails,
because we are trying to push all the data into a single kmalloc
buffer.
The issue is on these very large machines all the data will not
fit in 4mb.
A better solution is to not use the single_open() mechanism but
to provide our own seq_operations.
The output should be identical to previous version and thus not
need the version number.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
[ Fix memleak]
[ Fix spello in comment]
[ Fix warnings]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 mm changes from Peter Anvin:
"This is a huge set of several partly interrelated (and concurrently
developed) changes, which is why the branch history is messier than
one would like.
The *really* big items are two humonguous patchsets mostly developed
by Yinghai Lu at my request, which completely revamps the way we
create initial page tables. In particular, rather than estimating how
much memory we will need for page tables and then build them into that
memory -- a calculation that has shown to be incredibly fragile -- we
now build them (on 64 bits) with the aid of a "pseudo-linear mode" --
a #PF handler which creates temporary page tables on demand.
This has several advantages:
1. It makes it much easier to support things that need access to data
very early (a followon patchset uses this to load microcode way
early in the kernel startup).
2. It allows the kernel and all the kernel data objects to be invoked
from above the 4 GB limit. This allows kdump to work on very large
systems.
3. It greatly reduces the difference between Xen and native (Xen's
equivalent of the #PF handler are the temporary page tables created
by the domain builder), eliminating a bunch of fragile hooks.
The patch series also gets us a bit closer to W^X.
Additional work in this pull is the 64-bit get_user() work which you
were also involved with, and a bunch of cleanups/speedups to
__phys_addr()/__pa()."
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (105 commits)
x86, mm: Move reserving low memory later in initialization
x86, doc: Clarify the use of asm("%edx") in uaccess.h
x86, mm: Redesign get_user with a __builtin_choose_expr hack
x86: Be consistent with data size in getuser.S
x86, mm: Use a bitfield to mask nuisance get_user() warnings
x86/kvm: Fix compile warning in kvm_register_steal_time()
x86-32: Add support for 64bit get_user()
x86-32, mm: Remove reference to alloc_remap()
x86-32, mm: Remove reference to resume_map_numa_kva()
x86-32, mm: Rip out x86_32 NUMA remapping code
x86/numa: Use __pa_nodebug() instead
x86: Don't panic if can not alloc buffer for swiotlb
mm: Add alloc_bootmem_low_pages_nopanic()
x86, 64bit, mm: hibernate use generic mapping_init
x86, 64bit, mm: Mark data/bss/brk to nx
x86: Merge early kernel reserve for 32bit and 64bit
x86: Add Crash kernel low reservation
x86, kdump: Remove crashkernel range find limit for 64bit
memblock: Add memblock_mem_size()
x86, boot: Not need to check setup_header version for setup_data
...
Merge misc patches from Andrew Morton:
- Florian has vanished so I appear to have become fbdev maintainer
again :(
- Joel and Mark are distracted to welcome to the new OCFS2 maintainer
- The backlight queue
- Small core kernel changes
- lib/ updates
- The rtc queue
- Various random bits
* akpm: (164 commits)
rtc: rtc-davinci: use devm_*() functions
rtc: rtc-max8997: use devm_request_threaded_irq()
rtc: rtc-max8907: use devm_request_threaded_irq()
rtc: rtc-da9052: use devm_request_threaded_irq()
rtc: rtc-wm831x: use devm_request_threaded_irq()
rtc: rtc-tps80031: use devm_request_threaded_irq()
rtc: rtc-lp8788: use devm_request_threaded_irq()
rtc: rtc-coh901331: use devm_clk_get()
rtc: rtc-vt8500: use devm_*() functions
rtc: rtc-tps6586x: use devm_request_threaded_irq()
rtc: rtc-imxdi: use devm_clk_get()
rtc: rtc-cmos: use dev_warn()/dev_dbg() instead of printk()/pr_debug()
rtc: rtc-pcf8583: use dev_warn() instead of printk()
rtc: rtc-sun4v: use pr_warn() instead of printk()
rtc: rtc-vr41xx: use dev_info() instead of printk()
rtc: rtc-rs5c313: use pr_err() instead of printk()
rtc: rtc-at91rm9200: use dev_dbg()/dev_err() instead of printk()/pr_debug()
rtc: rtc-rs5c372: use dev_dbg()/dev_warn() instead of printk()/pr_debug()
rtc: rtc-ds2404: use dev_err() instead of printk()
rtc: rtc-efi: use dev_err()/dev_warn()/pr_err() instead of printk()
...
We can use user_ns, which is also assigned from task_cred_xxx(tsk,
user_ns), at the beginning of copy_namespaces().
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove a tabstop from the switch statement, in the usual fashion. A few
instances of weirdwrapping were removed as a result.
Cc: Chen Gang <gang.chen@asianux.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
arg2 will never < 0, for its type is 'unsigned long'
Also, use the provided macros.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I'm testing swapout workload in a two-socket Xeon machine. The workload
has 10 threads, each thread sequentially accesses separate memory
region. TLB flush overhead is very big in the workload. For each page,
page reclaim need move it from active lru list and then unmap it. Both
need a TLB flush. And this is a multthread workload, TLB flush happens
in 10 CPUs. In X86, TLB flush uses generic smp_call)function. So this
workload stress smp_call_function_many heavily.
Without patch, perf shows:
+ 24.49% [k] generic_smp_call_function_interrupt
- 21.72% [k] _raw_spin_lock
- _raw_spin_lock
+ 79.80% __page_check_address
+ 6.42% generic_smp_call_function_interrupt
+ 3.31% get_swap_page
+ 2.37% free_pcppages_bulk
+ 1.75% handle_pte_fault
+ 1.54% put_super
+ 1.41% grab_super_passive
+ 1.36% __swap_duplicate
+ 0.68% blk_flush_plug_list
+ 0.62% swap_info_get
+ 6.55% [k] flush_tlb_func
+ 6.46% [k] smp_call_function_many
+ 5.09% [k] call_function_interrupt
+ 4.75% [k] default_send_IPI_mask_sequence_phys
+ 2.18% [k] find_next_bit
swapout throughput is around 1300M/s.
With the patch, perf shows:
- 27.23% [k] _raw_spin_lock
- _raw_spin_lock
+ 80.53% __page_check_address
+ 8.39% generic_smp_call_function_single_interrupt
+ 2.44% get_swap_page
+ 1.76% free_pcppages_bulk
+ 1.40% handle_pte_fault
+ 1.15% __swap_duplicate
+ 1.05% put_super
+ 0.98% grab_super_passive
+ 0.86% blk_flush_plug_list
+ 0.57% swap_info_get
+ 8.25% [k] default_send_IPI_mask_sequence_phys
+ 7.55% [k] call_function_interrupt
+ 7.47% [k] smp_call_function_many
+ 7.25% [k] flush_tlb_func
+ 3.81% [k] _raw_spin_lock_irqsave
+ 3.78% [k] generic_smp_call_function_single_interrupt
swapout throughput is around 1400M/s. So there is around a 7%
improvement, and total cpu utilization doesn't change.
Without the patch, cfd_data is shared by all CPUs.
generic_smp_call_function_interrupt does read/write cfd_data several times
which will create a lot of cache ping-pong. With the patch, the data
becomes per-cpu. The ping-pong is avoided. And from the perf data, this
doesn't make call_single_queue lock contend.
Next step is to remove generic_smp_call_function_interrupt() from arch
code.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
How is the compiler even handling exported functions that are marked
inline? Anyway, these shouldn't be inline because of that, so remove
that marking.
Based on a larger patch by Mark Charlebois to get LLVM to build the
kernel.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mark Charlebois <mcharleb@qualcomm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: hank <pyu@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The copy_to_user() call returns the number of bytes remaining but we
want to return -EFAULT on error.
Fixes "x32: fix waitid()"
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The IRQ_EXIT_OFFSET trick was used to make sure the irq
doesn't get preempted after we substract the HARDIRQ_OFFSET
until we are entirely done with any code in irq_exit().
This workaround was necessary because some archs may call
irq_exit() with irqs enabled and there is still some code
in the end of this function that is not covered by the
HARDIRQ_OFFSET but want to stay non-preemptible.
Now that irq are always disabled in irq_exit(), the whole code
is guaranteed not to be preempted. We can thus remove this hack.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
A large number of cleanups, all over the platforms. This is dominated
largely by the Samsung platforms (s3c, s5p, exynos) and a few of the
others moving code out of arch/arm into more appropriate subsystems.
The clocksource and irqchip drivers are now abstracted to the point
where platforms that are already cleaned up do not need to even specify
the driver they use, it can all get configured from the device tree
as we do for normal device drivers. The clocksource changes basically
touch every single platform in the process.
We further clean up the use of platform specific header files here,
with the goal of turning more of the platforms over to being
"multiplatform" enabled, which implies that they cannot expose
their headers to architecture independent code any more.
It is expected that no functional changes are part of the cleanup.
The overall reduction in total code lines is mostly the result of
removing broken and obsolete code.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIVAwUAUSUyKmCrR//JCVInAQIN8RAAnb/uPytmlMjn5yCksF4Mvb/FVbn/TVwz
KRIGpCHOzyKK1q7pM8NRUVWfjW2SZqbXJFqx6zBGKSlDPvFTOhsLyyupU+Tnyu5W
IX4eIUBwb+a6H7XDHw0X2YI8uHzi5RNLhne0A1QyDKcnuHs1LDAttXnJHaK4Ap6Y
NN2YFt3l3ld7DXWXJtMsw5v8lC10aeIFGTvXefaPDAdeMLivmI57qEUMDXknNr7W
Odz/Rc0/cw3BNBVl/zNHA0jw7FOjKAymCYYNUa4xDCJEr+JnIRTqizd0N/YIIC7x
aA2xjJ3oKUFyF51yiJE6nFuTyJznhwtehc+uiMOSIkjrPLym52LEHmd7G5Yqlmjz
oiei09qBb870q3lGxwfht9iaeIwYgQFYGfD0yW5QWArCO5pxhtCPLPH7YZNZtcQd
ZJRSGGqT/ljBz3bm0K9OLESeeTTN7+Nxvtpiz/CD+Piegz0gWJzDYJRTzkJ3UWpA
WTVhVQdWUeX2JrNkgM7Z3Tu8iXOe+LIEs7kVXGJZSREmIIZiRvR36UrODZtAkp9I
7YQ+srX/uaR832pgK0RrHK0zY0psU6MmIvhYxJZFbx7keiPA9eH6drb0x7tGqcUD
FzEUzvcZvyqppndfBi+R60H/YKAhJDEXdwxzo6dyCpPQaW1T9GnzIqXuE1zin+Aw
X7Y8YywMbHI=
=DvgJ
-----END PGP SIGNATURE-----
Merge tag 'cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC cleanups from Arnd Bergmann:
"A large number of cleanups, all over the platforms. This is dominated
largely by the Samsung platforms (s3c, s5p, exynos) and a few of the
others moving code out of arch/arm into more appropriate subsystems.
The clocksource and irqchip drivers are now abstracted to the point
where platforms that are already cleaned up do not need to even
specify the driver they use, it can all get configured from the device
tree as we do for normal device drivers. The clocksource changes
basically touch every single platform in the process.
We further clean up the use of platform specific header files here,
with the goal of turning more of the platforms over to being
"multiplatform" enabled, which implies that they cannot expose their
headers to architecture independent code any more.
It is expected that no functional changes are part of the cleanup.
The overall reduction in total code lines is mostly the result of
removing broken and obsolete code."
* tag 'cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (133 commits)
ARM: mvebu: correct gated clock documentation
ARM: kirkwood: add missing include for nsa310
ARM: exynos: move exynos4210-combiner to drivers/irqchip
mfd: db8500-prcmu: update resource passing
drivers/db8500-cpufreq: delete dangling include
ARM: at91: remove NEOCORE 926 board
sunxi: Cleanup the reset code and add meaningful registers defines
ARM: S3C24XX: header mach/regs-mem.h local
ARM: S3C24XX: header mach/regs-power.h local
ARM: S3C24XX: header mach/regs-s3c2412-mem.h local
ARM: S3C24XX: Remove plat-s3c24xx directory in arch/arm/
ARM: S3C24XX: transform s3c2443 subirqs into new structure
ARM: S3C24XX: modify s3c2443 irq init to initialize all irqs
ARM: S3C24XX: move s3c2443 irq code to irq.c
ARM: S3C24XX: transform s3c2416 irqs into new structure
ARM: S3C24XX: modify s3c2416 irq init to initialize all irqs
ARM: S3C24XX: move s3c2416 irq init to common irq code
ARM: S3C24XX: Modify s3c_irq_wake to use the hwirq property
ARM: S3C24XX: Move irq syscore-ops to irq-pm
clocksource: always define CLOCKSOURCE_OF_DECLARE
...
Here's the big tty/serial driver patches for 3.9-rc1.
More tty port rework and fixes from Jiri here, as well as lots of
individual serial driver updates and fixes.
All of these have been in the linux-next tree for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlEmZYQACgkQMUfUDdst+ylJDgCg0B0nMevUUdM4hLvxunbbiyXM
HUEAoIOedqriNNPvX4Bwy0hjeOEaWx0g
=vi6x
-----END PGP SIGNATURE-----
Merge tag 'tty-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial patches from Greg Kroah-Hartman:
"Here's the big tty/serial driver patches for 3.9-rc1.
More tty port rework and fixes from Jiri here, as well as lots of
individual serial driver updates and fixes.
All of these have been in the linux-next tree for a while."
* tag 'tty-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (140 commits)
tty: mxser: improve error handling in mxser_probe() and mxser_module_init()
serial: imx: fix uninitialized variable warning
serial: tegra: assume CONFIG_OF
TTY: do not update atime/mtime on read/write
lguest: select CONFIG_TTY to build properly.
ARM defconfigs: add missing inclusions of linux/platform_device.h
fb/exynos: include platform_device.h
ARM: sa1100/assabet: include platform_device.h directly
serial: imx: Fix recursive locking bug
pps: Fix build breakage from decoupling pps from tty
tty: Remove ancient hardpps()
pps: Additional cleanups in uart_handle_dcd_change
pps: Move timestamp read into PPS code proper
pps: Don't crash the machine when exiting will do
pps: Fix a use-after free bug when unregistering a source.
pps: Use pps_lookup_dev to reduce ldisc coupling
pps: Add pps_lookup_dev() function
tty: serial: uartlite: Support uartlite on big and little endian systems
tty: serial: uartlite: Fix sparse and checkpatch warnings
serial/arc-uart: Miscll DT related updates (Grant's review comments)
...
Fix up trivial conflicts, mostly just due to the TTY config option
clashing with the EXPERIMENTAL removal.
Here is the big driver core merge for 3.9-rc1
There are two major series here, both of which touch lots of drivers all
over the kernel, and will cause you some merge conflicts:
- add a new function called devm_ioremap_resource() to properly be
able to check return values.
- remove CONFIG_EXPERIMENTAL
If you need me to provide a merged tree to handle these resolutions,
please let me know.
Other than those patches, there's not much here, some minor fixes and
updates.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlEmV0cACgkQMUfUDdst+yncCQCfbmnQZju7kzWXk6PjdFuKspT9
weAAoMCzcAtEzzc4LXuUxxG/sXBVBCjW
=yWAQ
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core patches from Greg Kroah-Hartman:
"Here is the big driver core merge for 3.9-rc1
There are two major series here, both of which touch lots of drivers
all over the kernel, and will cause you some merge conflicts:
- add a new function called devm_ioremap_resource() to properly be
able to check return values.
- remove CONFIG_EXPERIMENTAL
Other than those patches, there's not much here, some minor fixes and
updates"
Fix up trivial conflicts
* tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (221 commits)
base: memory: fix soft/hard_offline_page permissions
drivercore: Fix ordering between deferred_probe and exiting initcalls
backlight: fix class_find_device() arguments
TTY: mark tty_get_device call with the proper const values
driver-core: constify data for class_find_device()
firmware: Ignore abort check when no user-helper is used
firmware: Reduce ifdef CONFIG_FW_LOADER_USER_HELPER
firmware: Make user-mode helper optional
firmware: Refactoring for splitting user-mode helper code
Driver core: treat unregistered bus_types as having no devices
watchdog: Convert to devm_ioremap_resource()
thermal: Convert to devm_ioremap_resource()
spi: Convert to devm_ioremap_resource()
power: Convert to devm_ioremap_resource()
mtd: Convert to devm_ioremap_resource()
mmc: Convert to devm_ioremap_resource()
mfd: Convert to devm_ioremap_resource()
media: Convert to devm_ioremap_resource()
iommu: Convert to devm_ioremap_resource()
drm: Convert to devm_ioremap_resource()
...
This reverts commit 351429b2e62b6545bb10c756686393f29ba268a1. The
extra local_irq_save() is not longer needed as the call site now
always calls with interrupts disabled.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
With the irq protection in irq_exit, we can remove the #ifdeffery and
the bh_disable/enable dance in invoke_softirq()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1302202155320.22263@ionos
We had already a few problems with code called from irq_exit() when
interrupted from a nesting interrupt. This can happen on architectures
which do not define __ARCH_IRQ_EXIT_IRQS_DISABLED.
__ARCH_IRQ_EXIT_IRQS_DISABLED should go away and we want to make it
mandatory to call irq_exit() with interrupts disabled.
As a temporary protection disable interrupts for those architectures
which do not define __ARCH_IRQ_EXIT_IRQS_DISABLED and add a WARN_ONCE
when an architecture which defines __ARCH_IRQ_EXIT_IRQS_DISABLED calls
irq_exit() with interrupts enabled.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1302202155320.22263@ionos
As it stands, irq_exit() may or may not be called with
irqs disabled, depending on __ARCH_IRQ_EXIT_IRQS_DISABLED
that the arch can define.
It makes tick_nohz_irq_exit() unsafe. For example two
interrupts can race in tick_nohz_stop_sched_tick(): the inner
most one computes the expiring time on top of the timer list,
then it's interrupted right before reprogramming the
clock. The new interrupt enqueues a new timer list timer,
it reprogram the clock to take it into account and it exits.
The CPUs resumes the inner most interrupt and performs the clock
reprogramming without considering the new timer list timer.
This regression has been introduced by:
280f06774a
("nohz: Separate out irq exit and idle loop dyntick logic")
Let's fix it right now with the appropriate protections.
A saner long term solution will be to remove
__ARCH_IRQ_EXIT_IRQS_DISABLED and mandate that irq_exit() is called
with interrupts disabled.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: <stable@vger.kernel.org> #v3.2+
Link: http://lkml.kernel.org/r/1361373336-11337-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When idr_find() was fed a negative ID, it used to look up the ID
ignoring the sign bit before recent ("idr: remove MAX_IDR_MASK and
move left MAX_IDR_* into idr.c") patch. Now a negative ID triggers
a WARN_ON_ONCE().
__lock_timer() feeds timer_id from userland directly to idr_find()
without sanitizing it which can trigger the above malfunctions. Add a
range check on @timer_id before invoking idr_find() in __lock_timer().
While timer_t is defined as int by all archs at the moment, Andrew
worries that it may be defined as a larger type later on. Make the
test cover larger integers too so that it at least is guaranteed to
not return the wrong timer.
Note that WARN_ON_ONCE() in idr_find() on id < 0 is transitional
precaution while moving away from ignoring MSB. Once it's gone we can
remove the guard as long as timer_t isn't larger than int.
Signed-off-by: Tejun Heo <tj@kernel.org>nnn
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20130220232412.GL3570@htj.dyndns.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull networking update from David Miller:
1) Checkpoint/restarted TCP sockets now can properly propagate the TCP
timestamp offset. From Andrey Vagin.
2) VMWARE VM VSOCK layer, from Andy King.
3) Much improved support for virtual functions and SR-IOV in bnx2x,
from Ariel ELior.
4) All protocols on ipv4 and ipv6 are now network namespace aware, and
all the compatability checks for initial-namespace-only protocols is
removed. Thanks to Tom Parkin for helping deal with the last major
holdout, L2TP.
5) IPV6 support in netpoll and network namespace support in pktgen,
from Cong Wang.
6) Multiple Registration Protocol (MRP) and Multiple VLAN Registration
Protocol (MVRP) support, from David Ward.
7) Compute packet lengths more accurately in the packet scheduler, from
Eric Dumazet.
8) Use per-task page fragment allocator in skb_append_datato_frags(),
also from Eric Dumazet.
9) Add support for connection tracking labels in netfilter, from
Florian Westphal.
10) Fix default multicast group joining on ipv6, and add anti-spoofing
checks to 6to4 and 6rd. From Hannes Frederic Sowa.
11) Make ipv4/ipv6 fragmentation memory limits more reasonable in modern
times, rearrange inet frag datastructures for better cacheline
locality, and move more operations outside of locking. From Jesper
Dangaard Brouer.
12) Instead of strict master <--> slave relationships, allow arbitrary
scenerios with "upper device lists". From Jiri Pirko.
13) Improve rate limiting accuracy in TBF and act_police, also from Jiri
Pirko.
14) Add a BPF filter netfilter match target, from Willem de Bruijn.
15) Orphan and delete a bunch of pre-historic networking drivers from
Paul Gortmaker.
16) Add TSO support for GRE tunnels, from Pravin B SHelar. Although
this still needs some minor bug fixing before it's %100 correct in
all cases.
17) Handle unresolved IPSEC states like ARP, with a resolution packet
queue. From Steffen Klassert.
18) Remove TCP Appropriate Byte Count support (ABC), from Stephen
Hemminger. This was long overdue.
19) Support SO_REUSEPORT, from Tom Herbert.
20) Allow locking a socket BPF filter, so that it cannot change after a
process drops capabilities.
21) Add VLAN filtering to bridge, from Vlad Yasevich.
22) Bring ipv6 on-par with ipv4 and do not cache neighbour entries in
the ipv6 routes, from YOSHIFUJI Hideaki.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1538 commits)
ipv6: fix race condition regarding dst->expires and dst->from.
net: fix a wrong assignment in skb_split()
ip_gre: remove an extra dst_release()
ppp: set qdisc_tx_busylock to avoid LOCKDEP splat
atl1c: restore buffer state
net: fix a build failure when !CONFIG_PROC_FS
net: ipv4: fix waring -Wunused-variable
net: proc: fix build failed when procfs is not configured
Revert "xen: netback: remove redundant xenvif_put"
net: move procfs code to net/core/net-procfs.c
qmi_wwan, cdc-ether: add ADU960S
bonding: set sysfs device_type to 'bond'
bonding: fix bond_release_all inconsistencies
b44: use netdev_alloc_skb_ip_align()
xen: netback: remove redundant xenvif_put
net: fec: Do a sanity check on the gpio number
ip_gre: propogate target device GSO capability to the tunnel device
ip_gre: allow CSUM capable devices to handle packets
bonding: Fix initialize after use for 3ad machine state spinlock
bonding: Fix race condition between bond_enslave() and bond_3ad_update_lacp_rate()
...
- Rework of the ACPI namespace scanning code from Rafael J. Wysocki
with contributions from Bjorn Helgaas, Jiang Liu, Mika Westerberg,
Toshi Kani, and Yinghai Lu.
- ACPI power resources handling and ACPI device PM update from
Rafael J. Wysocki.
- ACPICA update to version 20130117 from Bob Moore and Lv Zheng
with contributions from Aaron Lu, Chao Guan, Jesper Juhl, and
Tim Gardner.
- Support for Intel Lynxpoint LPSS from Mika Westerberg.
- cpuidle update from Len Brown including Intel Haswell support, C1
state for intel_idle, removal of global pm_idle.
- cpuidle fixes and cleanups from Daniel Lezcano.
- cpufreq fixes and cleanups from Viresh Kumar and Fabio Baltieri
with contributions from Stratos Karafotis and Rickard Andersson.
- Intel P-states driver for Sandy Bridge processors from
Dirk Brandewie.
- cpufreq driver for Marvell Kirkwood SoCs from Andrew Lunn.
- cpufreq fixes related to ordering issues between acpi-cpufreq and
powernow-k8 from Borislav Petkov and Matthew Garrett.
- cpufreq support for Calxeda Highbank processors from Mark Langsdorf
and Rob Herring.
- cpufreq driver for the Freescale i.MX6Q SoC and cpufreq-cpu0 update
from Shawn Guo.
- cpufreq Exynos fixes and cleanups from Jonghwan Choi, Sachin Kamat,
and Inderpal Singh.
- Support for "lightweight suspend" from Zhang Rui.
- Removal of the deprecated power trace API from Paul Gortmaker.
- Assorted updates from Andreas Fleig, Colin Ian King,
Davidlohr Bueso, Joseph Salisbury, Kees Cook, Li Fei,
Nishanth Menon, ShuoX Liu, Srinivas Pandruvada, Tejun Heo,
Thomas Renninger, and Yasuaki Ishimatsu.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABAgAGBQJRIsArAAoJEKhOf7ml8uNsD6MP/j7C4NA+GTq6RdwoJt+Yki0K
9Ep8I4pEuRFoN/oskv24EyQhpGJIk6UxWcJ/DWFBc+1VhmKORta7k2Idv/wlJA77
s7AcDveA9xcDh+TVfbh87TeuiMSXiSdDZbiaQO+wMizWJAF3F84AnjiAqqqyQcSK
bA5/Siz/vWlt9PyYDaQtHTVE4lpvPuVcQdYewsdaH2PsmUjvIg/TUzg28CTrdyvv
eHOdBK9R0/OLQLhzRbL0VOGJ//wEl+HJRO0QEhTKPgdQ1e/VH/4Zu5WSzF8P/x4C
s2f8U4IKQqulDuDHXtpMpelFm7hRWgsOqZLkcyXLs+0dvSM9CTPO6P0ZaImxUctk
5daHWEsXUnCErDQawt1mcZP8l6qnxofMQIfLXyPVzvlSnHyToTmrtXa1v2u4AuL/
hOo4MYWsFNUmRdtGFFGlExGgEDZ4G5NwiYjRBl/6XJ3v4nhnnMbuzxP8scpoe5m1
8tjroJHZFUUs/mFU/H+oRbHzSzXPmp1sddNaTg4OpVmTn3DDh6ljnFhiItd1Ndw0
5ldVbSe6ETq5RoK0TbzvQOeVpa9F3JfqbrXLQPqfd2iz/No41LQYG1uShRYuXKuA
wfEcc+c9VMd3FILu05pGwBnU8VS9VbxTYMz7xDxg6b29Ywnb7u+Q1ycCk2gFYtkS
E2oZDuyewTJxaskzYsNr
=wijn
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management updates from Rafael Wysocki:
- Rework of the ACPI namespace scanning code from Rafael J. Wysocki
with contributions from Bjorn Helgaas, Jiang Liu, Mika Westerberg,
Toshi Kani, and Yinghai Lu.
- ACPI power resources handling and ACPI device PM update from Rafael
J Wysocki.
- ACPICA update to version 20130117 from Bob Moore and Lv Zheng with
contributions from Aaron Lu, Chao Guan, Jesper Juhl, and Tim Gardner.
- Support for Intel Lynxpoint LPSS from Mika Westerberg.
- cpuidle update from Len Brown including Intel Haswell support, C1
state for intel_idle, removal of global pm_idle.
- cpuidle fixes and cleanups from Daniel Lezcano.
- cpufreq fixes and cleanups from Viresh Kumar and Fabio Baltieri with
contributions from Stratos Karafotis and Rickard Andersson.
- Intel P-states driver for Sandy Bridge processors from Dirk
Brandewie.
- cpufreq driver for Marvell Kirkwood SoCs from Andrew Lunn.
- cpufreq fixes related to ordering issues between acpi-cpufreq and
powernow-k8 from Borislav Petkov and Matthew Garrett.
- cpufreq support for Calxeda Highbank processors from Mark Langsdorf
and Rob Herring.
- cpufreq driver for the Freescale i.MX6Q SoC and cpufreq-cpu0 update
from Shawn Guo.
- cpufreq Exynos fixes and cleanups from Jonghwan Choi, Sachin Kamat,
and Inderpal Singh.
- Support for "lightweight suspend" from Zhang Rui.
- Removal of the deprecated power trace API from Paul Gortmaker.
- Assorted updates from Andreas Fleig, Colin Ian King, Davidlohr Bueso,
Joseph Salisbury, Kees Cook, Li Fei, Nishanth Menon, ShuoX Liu,
Srinivas Pandruvada, Tejun Heo, Thomas Renninger, and Yasuaki
Ishimatsu.
* tag 'pm+acpi-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (267 commits)
PM idle: remove global declaration of pm_idle
unicore32 idle: delete stray pm_idle comment
openrisc idle: delete pm_idle
mn10300 idle: delete pm_idle
microblaze idle: delete pm_idle
m32r idle: delete pm_idle, and other dead idle code
ia64 idle: delete pm_idle
cris idle: delete idle and pm_idle
ARM64 idle: delete pm_idle
ARM idle: delete pm_idle
blackfin idle: delete pm_idle
sparc idle: rename pm_idle to sparc_idle
sh idle: rename global pm_idle to static sh_idle
x86 idle: rename global pm_idle to static x86_idle
APM idle: register apm_cpu_idle via cpuidle
cpufreq / intel_pstate: Add kernel command line option disable intel_pstate.
cpufreq / intel_pstate: Change to disallow module build
tools/power turbostat: display SMI count by default
intel_idle: export both C1 and C1E
ACPI / hotplug: Fix concurrency issues and memory leaks
...
Pull cpuset changes from Tejun Heo:
- Synchornization has seen a lot of changes with focus on decoupling
cpuset synchronization from cgroup internal locking.
After this change, there only remain a couple of mostly trivial
dependencies on cgroup_lock outside cgroup core proper. cgroup_lock
is scheduled to be unexported in this devel cycle.
This will finally remove the fragile locking order around cgroup
(cgroup locking wants to / should be one of the outermost but yet has
been acquired from deep inside individual controllers).
- At this point, Li is most knowlegeable with cpuset and taking over
the maintainership of cpuset.
* 'for-3.9-cpuset' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: drop spurious retval assignment in proc_cpuset_show()
cpuset: fix RCU lockdep splat
cpuset: update MAINTAINERS
cpuset: remove cpuset->parent
cpuset: replace cpuset->stack_list with cpuset_for_each_descendant_pre()
cpuset: replace cgroup_mutex locking with cpuset internal locking
cpuset: schedule hotplug propagation from cpuset_attach() if the cpuset is empty
cpuset: pin down cpus and mems while a task is being attached
cpuset: make CPU / memory hotplug propagation asynchronous
cpuset: drop async_rebuild_sched_domains()
cpuset: don't nest cgroup_mutex inside get_online_cpus()
cpuset: reorganize CPU / memory hotplug handling
cpuset: cleanup cpuset[_can]_attach()
cpuset: introduce cpuset_for_each_child()
cpuset: introduce CS_ONLINE
cpuset: introduce ->css_on/offline()
cpuset: remove fast exit path from remove_tasks_in_empty_cpuset()
cpuset: remove unused cpuset_unlock()
Pull cgroup changes from Tejun Heo:
"Nothing too drastic.
- Removal of synchronize_rcu() from userland visible paths.
- Various fixes and cleanups from Li.
- cgroup_rightmost_descendant() added which will be used by cpuset
changes (it will be a separate pull request)."
* 'for-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fail if monitored file and event_control are in different cgroup
cgroup: fix cgroup_rmdir() vs close(eventfd) race
cpuset: fix cpuset_print_task_mems_allowed() vs rename() race
cgroup: fix exit() vs rmdir() race
cgroup: remove bogus comments in cgroup_diput()
cgroup: remove synchronize_rcu() from cgroup_diput()
cgroup: remove duplicate RCU free on struct cgroup
sched: remove redundant NULL cgroup check in task_group_path()
sched: split out css_online/css_offline from tg creation/destruction
cgroup: initialize cgrp->dentry before css_alloc()
cgroup: remove a NULL check in cgroup_exit()
cgroup: fix bogus kernel warnings when cgroup_create() failed
cgroup: remove synchronize_rcu() from rebind_subsystems()
cgroup: remove synchronize_rcu() from cgroup_attach_{task|proc}()
cgroup: use new hashtable implementation
cgroups: fix cgroup_event_listener error handling
cgroups: move cgroup_event_listener.c to tools/cgroup
cgroup: implement cgroup_rightmost_descendant()
cgroup: remove unused dummy cgroup_fork_callbacks()
Pull async changes from Tejun Heo:
"These are followups for the earlier deadlock issue involving async
ending up waiting for itself through block requesting module[1]. The
following changes are made by these commits.
- Instead of requesting default elevator on each request_queue init,
block now requests it once early during boot.
- Kmod triggers warning if invoked from an async worker.
- Async synchronization implementation has been reimplemented. It's
a lot simpler now."
* 'for-3.9-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
async: initialise list heads to fix crash
async: replace list of active domains with global list of pending items
async: keep pending tasks on async_domain and remove async_pending
async: use ULLONG_MAX for infinity cookie value
async: bring sanity to the use of words domain and running
async, kmod: warn on synchronous request_module() from async workers
block: don't request module during elevator init
init, block: try to load default elevator module early during boot
Pull workqueue changes from Tejun Heo:
"A lot of reorganization is going on mostly to prepare for worker pools
with custom attributes so that workqueue can replace custom pool
implementations in places including writeback and btrfs and make CPU
assignment in crypto more flexible.
workqueue evolved from purely per-cpu design and implementation, so
there are a lot of assumptions regarding being bound to CPUs and even
unbound workqueues are implemented as an extension of the model -
workqueues running on the special unbound CPU. Bulk of changes this
round are about promoting worker_pools as the top level abstraction
replacing global_cwq (global cpu workqueue). At this point, I'm
fairly confident about getting custom worker pools working pretty soon
and ready for the next merge window.
Lai's patches are replacing the convoluted mb() dancing workqueue has
been doing with much simpler mechanism which only depends on
assignment atomicity of long. For details, please read the commit
message of 0b3dae68ac ("workqueue: simplify is-work-item-queued-here
test"). While the change ends up adding one pointer to struct
delayed_work, the inflation in percentage is less than five percent
and it decouples delayed_work logic a lot more cleaner from usual work
handling, removes the unusual memory barrier dancing, and allows for
further simplification, so I think the trade-off is acceptable.
There will be two more workqueue related pull requests and there are
some shared commits among them. I'll write further pull requests
assuming this pull request is pulled first."
* 'for-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (37 commits)
workqueue: un-GPL function delayed_work_timer_fn()
workqueue: rename cpu_workqueue to pool_workqueue
workqueue: reimplement is_chained_work() using current_wq_worker()
workqueue: fix is_chained_work() regression
workqueue: pick cwq instead of pool in __queue_work()
workqueue: make get_work_pool_id() cheaper
workqueue: move nr_running into worker_pool
workqueue: cosmetic update in try_to_grab_pending()
workqueue: simplify is-work-item-queued-here test
workqueue: make work->data point to pool after try_to_grab_pending()
workqueue: add delayed_work->wq to simplify reentrancy handling
workqueue: make work_busy() test WORK_STRUCT_PENDING first
workqueue: replace WORK_CPU_NONE/LAST with WORK_CPU_END
workqueue: post global_cwq removal cleanups
workqueue: rename nr_running variables
workqueue: remove global_cwq
workqueue: remove worker_pool->gcwq
workqueue: replace for_each_worker_pool() with for_each_std_worker_pool()
workqueue: make freezing/thawing per-pool
workqueue: make hotplug processing per-pool
...
Pull workqueue [delayed_]work_pending() cleanups from Tejun Heo:
"This is part of on-going cleanups to remove / minimize usages of
workqueue interfaces which are deprecated and/or misleading.
This round drops a number of usages of [delayed_]work_pending(), which
are dangerous as they lack any form of synchronization and thus often
lead to buggy / unnecessary code. There are a couple legitimate use
cases in kernel. Hopefully, they can be converted and
[delayed_]work_pending() can be removed completely. Even if not,
removing most of misuses should make it more difficult to find
examples of misuses and thus slow down growth of them.
These changes are independent from other workqueue changes."
* 'for-3.9-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
wimax/i2400m: fix i2400m->wake_tx_skb handling
kprobes: fix wait_for_kprobe_optimizer()
ipw2x00: simplify scan_event handling
video/exynos: don't use [delayed_]work_pending()
tty/max3100: don't use [delayed_]work_pending()
x86/mce: don't use [delayed_]work_pending()
rfkill: don't use [delayed_]work_pending()
wl1251: don't use [delayed_]work_pending()
thinkpad_acpi: don't use [delayed_]work_pending()
mwifiex: don't use [delayed_]work_pending()
sja1000: don't use [delayed_]work_pending()
Pull two x86 kernel build changes from Ingo Molnar:
"The first change modifies how 'make oldconfig' works on cross-bitness
situations on x86. It was felt the new behavior of preserving the
bitness of the .config is more logical. This is a leftover of the
merge.
The second change eliminates a Perl warning. (There's another, more
complete fix resulting of this warning fix, which second fix in flight
to you via the kbuild tree, which will remove the timeconst.pl script
altogether.)"
* 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timeconst.pl: Eliminate Perl warning
x86: Default to ARCH=x86 to avoid overriding CONFIG_64BIT
Pull x86/apic changes from Ingo Molnar:
"Main changes:
- Multiple MSI support added to the APIC, PCI and AHCI code - acked
by all relevant maintainers, by Alexander Gordeev.
The advantage is that multiple AHCI ports can have multiple MSI
irqs assigned, and can thus spread to multiple CPUs.
[ Drivers can make use of this new facility via the
pci_enable_msi_block_auto() method ]
- x86 IOAPIC code from interrupt remapping cleanups from Joerg
Roedel:
These patches move all interrupt remapping specific checks out of
the x86 core code and replaces the respective call-sites with
function pointers. As a result the interrupt remapping code is
better abstraced from x86 core interrupt handling code.
- Various smaller improvements, fixes and cleanups."
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
x86/intel/irq_remapping: Clean up x2apic opt-out security warning mess
x86, kvm: Fix intialization warnings in kvm.c
x86, irq: Move irq_remapped out of x86 core code
x86, io_apic: Introduce eoi_ioapic_pin call-back
x86, msi: Introduce x86_msi.compose_msi_msg call-back
x86, irq: Introduce setup_remapped_irq()
x86, irq: Move irq_remapped() check into free_remapped_irq
x86, io-apic: Remove !irq_remapped() check from __target_IO_APIC_irq()
x86, io-apic: Move CONFIG_IRQ_REMAP code out of x86 core
x86, irq: Add data structure to keep AMD specific irq remapping information
x86, irq: Move irq_remapping_enabled declaration to iommu code
x86, io_apic: Remove irq_remapping_enabled check in setup_timer_IRQ0_pin
x86, io_apic: Move irq_remapping_enabled checks out of check_timer()
x86, io_apic: Convert setup_ioapic_entry to function pointer
x86, io_apic: Introduce set_affinity function pointer
x86, msi: Use IRQ remapping specific setup_msi_irqs routine
x86, hpet: Introduce x86_msi_ops.setup_hpet_msi
x86, io_apic: Introduce x86_io_apic_ops.print_entries for debugging
x86, io_apic: Introduce x86_io_apic_ops.disable()
x86, apic: Mask IO-APIC and PIC unconditionally on LAPIC resume
...
Pull timer changes from Ingo Molnar:
"Main changes:
- ntp: Add CONFIG_RTC_SYSTOHC: a generic RTC driver facility
complementing the existing CONFIG_RTC_HCTOSYS, which uses NTP to
keep the hardware clock updated.
- posix-timers: Fix clock_adjtime to always return timex data on
success. This is changing the ABI, but no breakage was expected
and found - caution is warranted nevertheless.
- platform persistent clock improvements/cleanups.
- clockevents: refactor timer broadcast handling to be more generic
and less duplicated with matching architecture code (mostly ARM
motivated.)
- various fixes and cleanups"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/x86/hpet: Use HPET_COUNTER to specify the hpet counter in vread_hpet()
posix-cpu-timers: Fix nanosleep task_struct leak
clockevents: Fix generic broadcast for FEAT_C3STOP
time, Fix setting of hardware clock in NTP code
hrtimer: Prevent hrtimer_enqueue_reprogram race
clockevents: Add generic timer broadcast function
clockevents: Add generic timer broadcast receiver
timekeeping: Switch HAS_PERSISTENT_CLOCK to ALWAYS_USE_PERSISTENT_CLOCK
x86/time/rtc: Don't print extended CMOS year when reading RTC
x86: Select HAS_PERSISTENT_CLOCK on x86
timekeeping: Add CONFIG_HAS_PERSISTENT_CLOCK option
rtc: Skip the suspend/resume handling if persistent clock exist
timekeeping: Add persistent_clock_exist flag
posix-timers: Fix clock_adjtime to always return timex data on success
Round the calculated scale factor in set_cyc2ns_scale()
NTP: Add a CONFIG_RTC_SYSTOHC configuration
MAINTAINERS: Update John Stultz's email
time: create __getnstimeofday for WARNless calls
Pull preparatory smp/hotplug patches from Ingo Molnar:
"Some early preparatory changes for the WIP hotplug rework by Thomas
Gleixner."
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
stop_machine: Use smpboot threads
stop_machine: Store task reference in a separate per cpu variable
smpboot: Allow selfparking per cpu threads
Pull scheduler changes from Ingo Molnar:
"Main changes:
- scheduler side full-dynticks (user-space execution is undisturbed
and receives no timer IRQs) preparation changes that convert the
cputime accounting code to be full-dynticks ready, from Frederic
Weisbecker.
- Initial sched.h split-up changes, by Clark Williams
- select_idle_sibling() performance improvement by Mike Galbraith:
" 1 tbench pair (worst case) in a 10 core + SMT package:
pre 15.22 MB/sec 1 procs
post 252.01 MB/sec 1 procs "
- sched_rr_get_interval() ABI fix/change. We think this detail is not
used by apps (so it's not an ABI in practice), but lets keep it
under observation.
- misc RT scheduling cleanups, optimizations"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
sched/rt: Add <linux/sched/rt.h> header to <linux/init_task.h>
cputime: Remove irqsave from seqlock readers
sched, powerpc: Fix sched.h split-up build failure
cputime: Restore CPU_ACCOUNTING config defaults for PPC64
sched/rt: Move rt specific bits into new header file
sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
sched: Move sched.h sysctl bits into separate header
sched: Fix signedness bug in yield_to()
sched: Fix select_idle_sibling() bouncing cow syndrome
sched/rt: Further simplify pick_rt_task()
sched/rt: Do not account zero delta_exec in update_curr_rt()
cputime: Safely read cputime of full dynticks CPUs
kvm: Prepare to add generic guest entry/exit callbacks
cputime: Use accessors to read task cputime stats
cputime: Allow dynamic switch between tick/virtual based cputime accounting
cputime: Generic on-demand virtual cputime accounting
cputime: Move default nsecs_to_cputime() to jiffies based cputime file
cputime: Librarize per nsecs resolution cputime definitions
cputime: Avoid multiplication overflow on utime scaling
context_tracking: Export context state for generic vtime
...
Fix up conflict in kernel/context_tracking.c due to comment additions.
Pull perf changes from Ingo Molnar:
"There are lots of improvements, the biggest changes are:
Main kernel side changes:
- Improve uprobes performance by adding 'pre-filtering' support, by
Oleg Nesterov.
- Make some POWER7 events available in sysfs, equivalent to what was
done on x86, from Sukadev Bhattiprolu.
- tracing updates by Steve Rostedt - mostly misc fixes and smaller
improvements.
- Use perf/event tracing to report PCI Express advanced errors, by
Tony Luck.
- Enable northbridge performance counters on AMD family 15h, by Jacob
Shin.
- This tracing commit:
tracing: Remove the extra 4 bytes of padding in events
changes the ABI. All involved parties (PowerTop in particular)
seem to agree that it's safe to do now with the introduction of
libtraceevent, but the devil is in the details ...
Main tooling side changes:
- Add 'event group view', from Namyung Kim:
To use it, 'perf record' should group events when recording. And
then perf report parses the saved group relation from file header
and prints them together if --group option is provided. You can
use the 'perf evlist' command to see event group information:
$ perf record -e '{ref-cycles,cycles}' noploop 1
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.385 MB perf.data (~16807 samples) ]
$ perf evlist --group
{ref-cycles,cycles}
With this example, default perf report will show you each event
separately.
You can use --group option to enable event group view:
$ perf report --group
...
# group: {ref-cycles,cycles}
# ========
# Samples: 7K of event 'anon group { ref-cycles, cycles }'
# Event count (approx.): 6876107743
#
# Overhead Command Shared Object Symbol
# ................ ....... ................. ..........................
99.84% 99.76% noploop noploop [.] main
0.07% 0.00% noploop ld-2.15.so [.] strcmp
0.03% 0.00% noploop [kernel.kallsyms] [k] timerqueue_del
0.03% 0.03% noploop [kernel.kallsyms] [k] sched_clock_cpu
0.02% 0.00% noploop [kernel.kallsyms] [k] account_user_time
0.01% 0.00% noploop [kernel.kallsyms] [k] __alloc_pages_nodemask
0.00% 0.00% noploop [kernel.kallsyms] [k] native_write_msr_safe
0.00% 0.11% noploop [kernel.kallsyms] [k] _raw_spin_lock
0.00% 0.06% noploop [kernel.kallsyms] [k] find_get_page
0.00% 0.02% noploop [kernel.kallsyms] [k] rcu_check_callbacks
0.00% 0.02% noploop [kernel.kallsyms] [k] __current_kernel_time
As you can see the Overhead column now contains both of ref-cycles
and cycles and header line shows group information also - 'anon
group { ref-cycles, cycles }'. The output is sorted by period of
group leader first.
- Initial GTK+ annotate browser, from Namhyung Kim.
- Add option for runtime switching perf data file in perf report,
just press 's' and a menu with the valid files found in the current
directory will be presented, from Feng Tang.
- Add support to display whole group data for raw columns, from Jiri
Olsa.
- Add per processor socket count aggregation in perf stat, from
Stephane Eranian.
- Add interval printing in 'perf stat', from Stephane Eranian.
- 'perf test' improvements
- Add support for wildcards in tracepoint system name, from Jiri
Olsa.
- Add anonymous huge page recognition, from Joshua Zhu.
- perf build-id cache now can show DSOs present in a perf.data file
that are not in the cache, to integrate with build-id servers being
put in place by organizations such as Fedora.
- perf top now shares more of the evsel config/creation routines with
'record', paving the way for further integration like 'top'
snapshots, etc.
- perf top now supports DWARF callchains.
- Fix mmap limitations on 32-bit, fix from David Miller.
- 'perf bench numa mem' NUMA performance measurement suite
- ... and lots of fixes, performance improvements, cleanups and other
improvements I failed to list - see the shortlog and git log for
details."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (270 commits)
perf/x86/amd: Enable northbridge performance counters on AMD family 15h
perf/hwbp: Fix cleanup in case of kzalloc failure
perf tools: Fix build with bison 2.3 and older.
perf tools: Limit unwind support to x86 archs
perf annotate: Make it to be able to skip unannotatable symbols
perf gtk/annotate: Fail early if it can't annotate
perf gtk/annotate: Show source lines with gray color
perf gtk/annotate: Support multiple event annotation
perf ui/gtk: Implement basic GTK2 annotation browser
perf annotate: Fix warning message on a missing vmlinux
perf buildid-cache: Add --update option
uprobes/perf: Avoid uprobe_apply() whenever possible
uprobes/perf: Teach trace_uprobe/perf code to use UPROBE_HANDLER_REMOVE
uprobes/perf: Teach trace_uprobe/perf code to pre-filter
uprobes/perf: Teach trace_uprobe/perf code to track the active perf_event's
uprobes: Introduce uprobe_apply()
perf: Introduce hw_perf_event->tp_target and ->tp_list
uprobes/perf: Always increment trace_uprobe->nhit
uprobes/tracing: Kill uprobe_trace_consumer, embed uprobe_consumer into trace_uprobe
uprobes/tracing: Introduce is_trace_uprobe_enabled()
...
Pull irq core changes from Ingo Molnar:
"The biggest changes are the IRQ-work and printk changes from Frederic
Weisbecker, which prepare the code for 'full dynticks' (the ability to
stop or slow down the periodic tick arbitrarily, not just in idle time
as today):
- Don't stop tick with irq works pending. This fix is generally
useful and concerns archs that can't raise self IPIs.
- Flush irq works before CPU offlining.
- Introduce "lazy" irq works that can wait for the next tick to be
executed, unless it's stopped.
- Implement klogd wake up using irq work. This removes the ad-hoc
printk_tick()/printk_needs_cpu() hooks and make it working even in
dynticks mode.
- Cleanups and fixes."
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Export enable/disable_percpu_irq()
arch Kconfig: Remove references to IRQ_PER_CPU
irq_work: Remove return value from the irq_work_queue() function
genirq: Avoid deadlock in spurious handling
printk: Wake up klogd using irq_work
irq_work: Make self-IPIs optable
irq_work: Warn if there's still work on cpu_down
irq_work: Flush work on CPU_DYING
irq_work: Don't stop the tick with pending works
nohz: Add API to check tick state
irq_work: Remove CONFIG_HAVE_IRQ_WORK
irq_work: Fix racy check on work pending flag
irq_work: Fix racy IRQ_WORK_BUSY flag setting
Pull RCU changes from Ingo Molnar:
"SRCU changes:
- These include debugging aids, updates that move towards the goal of
permitting srcu_read_lock() and srcu_read_unlock() to be used from
idle and offline CPUs, and a few small fixes.
Changes to rcutorture and to RCU documentation:
- Posted to LKML at https://lkml.org/lkml/2013/1/26/188
Enhancements to uniprocessor handling in tiny RCU:
- Posted to LKML at https://lkml.org/lkml/2013/1/27/2
Tag RCU callbacks with grace-period number to simplify callback
advancement:
- Posted to LKML at https://lkml.org/lkml/2013/1/26/203
Miscellaneous fixes:
- Posted to LKML at https://lkml.org/lkml/2013/1/26/204"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
srcu: use ACCESS_ONCE() to access sp->completed in srcu_read_lock()
srcu: Update synchronize_srcu_expedited()'s comments
srcu: Update synchronize_srcu()'s comments
srcu: Remove checks preventing idle CPUs from calling srcu_read_lock()
srcu: Remove checks preventing offline CPUs from calling srcu_read_lock()
srcu: Simple cleanup for cleanup_srcu_struct()
srcu: Add might_sleep() annotation to synchronize_srcu()
srcu: Simplify __srcu_read_unlock() via this_cpu_dec()
rcu: Allow rcutorture to be built at low optimization levels
rcu: Make rcutorture's shuffler task shuffle recently added tasks
rcu: Allow TREE_PREEMPT_RCU on UP systems
rcu: Provide RCU CPU stall warnings for tiny RCU
context_tracking: Add comments on interface and internals
rcu: Remove obsolete Kconfig option from comment
rcu: Remove unused code originally used for context tracking
rcu: Consolidate debugging Kconfig options
rcu: Correct 'optimized' to 'optimize' in header comment
rcu: Trace callback acceleration
rcu: Tag callback lists with corresponding grace-period number
rcutorture: Don't compare ptr with 0
...
KMS drivers can potentially restore the display configuration without
userspace help. Such drivers can can call a new funciton,
pm_vt_switch_required(false) if they support this feature. In that
case, the PM layer won't VT switch to the suspend console at suspend
time and then back to the original VT on resume, but rather leave things
alone for a nicer looking suspend and resume sequence.
v2: make a function so we can handle multiple drivers (Alan)
v3: use a list to track device requests (Rafael)
v4: Squash in build fix from Jesse for CONFIG_VT_CONSOLE_SLEEP=n
v5: Squash in patch from Wu Fengguang to add a few missing static
qualifiers.
v6: Add missing EXPORT_SYMBOL.
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (v3)
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
commit d8e794dfd5 ("workqueue: set
delayed_work->timer function on initialization") exports function
delayed_work_timer_fn() only for GPL modules. This makes delayed-works
unusable for non-GPL modules, because initialization macro now requires
GPL symbol. For example schedule_delayed_work() available for non-GPL.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # 3.7
seconds_overflow() is called from hard interrupt context even on
Preempt-RT. This requires the lock to be a raw_spinlock.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The get_timestamp() function is always called with current cpu,
thus using local_clock() would be more appropriate and it makes
the code shorter and cleaner IMHO.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1356576585-28782-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The reader side code has no requirement to disable interrupts while
sampling data. The sequence counter is enough to ensure consistency.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull in 'net' to take in the bug fixes that didn't make it into
3.8-final.
Also, deal with the semantic conflict of the change made to
net/ipv6/xfrm6_policy.c A missing rt6->n neighbour release
was added to 'net', but in 'net-next' we no longer cache the
neighbour entries in the ipv6 routes so that change is not
appropriate there.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit: c1bf08ac "ftrace: Be first to run code modification on modules"
changed ftrace module notifier's priority to INT_MAX in order to
process the ftrace nops before anything else could touch them
(namely kprobes). This was the correct thing to do.
Unfortunately, the ftrace module notifier also contains the ftrace
clean up code. As opposed to the set up code, this code should be
run *after* all the module notifiers have run in case a module is doing
correct clean-up and unregisters its ftrace hooks. Basically, ftrace
needs to do clean up on module removal, as it needs to know about code
being removed so that it doesn't try to modify that code. But after it
removes the module from its records, if a ftrace user tries to remove
a probe, that removal will fail due as the record of that code segment
no longer exists.
Nothing really bad happens if the probe removal is called after ftrace
did the clean up, but the ftrace removal function will return an error.
Correct code (such as kprobes) will produce a WARN_ON() if it fails
to remove the probe. As people get annoyed by frivolous warnings, it's
best to do the ftrace clean up after everything else.
By splitting the ftrace_module_notifier into two notifiers, one that
does the module load setup that is run at high priority, and the other
that is called for module clean up that is run at low priority, the
problem is solved.
Cc: stable@vger.kernel.org
Reported-by: Frank Ch. Eigler <fche@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If we pass fd of memory.usage_in_bytes of cgroup A to cgroup.event_control
of cgroup B, then we won't get memory usage notification from A but B!
What's worse, if A and B are in different mount hierarchy, we'll end up
accessing NULL pointer!
Disallow this kind of invalid usage.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Tejun Heo <tj@kernel.org>
commit 205a872bd6 ("cgroup: fix lockdep
warning for event_control") solved a deadlock by introducing a new
bug.
Move cgrp->event_list to a temporary list doesn't mean you can traverse
this list locklessly, because at the same time cgroup_event_wake() can
be called and remove the event from the list. The result of this race
is disastrous.
We adopt the way how kvm irqfd code implements race-free event removal,
which is now described in the comments in cgroup_event_wake().
v3:
- call eventfd_signal() no matter it's eventfd close or cgroup removal
that removes the cgroup event.
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
rename() will change dentry->d_name. The result of this race can
be worse than seeing partially rewritten name, but we might access
a stale pointer because rename() will re-allocate memory to hold
a longer name.
It's safe in the protection of dentry->d_lock.
v2: check NULL dentry before acquiring dentry lock.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
In cgroup_exit() put_css_set_taskexit() is called without any lock,
which might lead to accessing a freed cgroup:
thread1 thread2
---------------------------------------------
exit()
cgroup_exit()
put_css_set_taskexit()
atomic_dec(cgrp->count);
rmdir();
/* not safe !! */
check_for_release(cgrp);
rcu_read_lock() can be used to make sure the cgroup is alive.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
commit 07354eb1a7 ("locking printk: Annotate logbuf_lock as raw")
reintroduced a lock inversion problem which was fixed in commit
0b5e1c5255 ("printk: Release console_sem after logbuf_lock"). This
happened probably when fixing up patch rejects.
Restore the ordering and unlock logbuf_lock before releasing
console_sem.
Signed-off-by: ybu <ybu@qti.qualcomm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/E807E903FE6CBE4D95E420FBFCC273B827413C@nasanexd01h.na.qualcomm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
bc is the standard tool for multi-precision arithmetic. We switched
to Perl because akpm reported a hard-to-reproduce build hang, which
was very odd because affected and unaffected machines were all running
the same version of GNU bc.
Unfortunately switching to Perl required a really ugly "canning"
mechanism to support Perl < 5.8 installations lacking the Math::BigInt
module.
It was recently pointed out to me that some very old versions of GNU
make had problems with pipes in subshells, which was indeed the
construct used in the Makefile rules in that version of the patch;
Perl didn't need it so switching to Perl fixed the problem for
unrelated reasons. With the problem (hopefully) root-caused, we can
switch back to bc and do the arbitrary-precision arithmetic naturally.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Michal Marek <mmarek@suse.cz>
PARISC defines /proc/sys/kernel/unaligned-trap to runtime toggle
unaligned access emulation.
The exact mechanics of enablig/disabling are still arch specific, we can
make the sysctl usable by other arches.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
The only difference between wait_for_completion[_timeout]() and
wait_for_completion_io[_timeout]() is that the latter calls
io_schedule_timeout() instead of schedule_timeout() so that the caller
is accounted as waiting for IO, not just sleeping.
These functions can be used for correct iowait time accounting when the
completion struct is actually used for waiting for IO (e.g. completion
of a bio request in the block layer).
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The trinity fuzzer triggered a task_struct reference leak via
clock_nanosleep with CPU_TIMERs. do_cpu_nanosleep() calls
posic_cpu_timer_create(), but misses a corresponding
posix_cpu_timer_del() which leads to the task_struct reference leak.
Reported-and-tested-by: Tommi Rantala <tt.rantala@gmail.com>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20130215100810.GF4392@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Obviously this is a typo and could result in memory leaks if kzalloc
fails on a given cpu.
Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1360186160-7566-1-git-send-email-dbaluta@ixiacom.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Use the smpboot thread infrastructure. Mark the stopper thread
selfparking and park it after it has finished the take_cpu_down()
work.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Arjan van de Veen <arjan@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Richard Weinberger <rw@linutronix.de>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130131120741.686315164@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
To allow the stopper thread being managed by the smpboot thread
infrastructure separate out the task storage from the stopper data
structure.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Arjan van de Veen <arjan@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Richard Weinberger <rw@linutronix.de>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130131120741.626690384@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The stop machine threads are still killed when a cpu goes offline. The
reason is that the thread is used to bring the cpu down, so it can't
be parked along with the other per cpu threads.
Allow a per cpu thread to be excluded from automatic parking, so it
can park itself once it's done
Add a create callback function as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Arjan van de Veen <arjan@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Richard Weinberger <rw@linutronix.de>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130131120741.553993267@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
__ARCH_WANT_SYS_RT_SIGACTION,
__ARCH_WANT_SYS_RT_SIGSUSPEND,
__ARCH_WANT_COMPAT_SYS_RT_SIGSUSPEND,
__ARCH_WANT_COMPAT_SYS_SCHED_RR_GET_INTERVAL - not used anymore
CONFIG_GENERIC_{SIGALTSTACK,COMPAT_RT_SIG{ACTION,QUEUEINFO,PENDING,PROCMASK}} -
can be assumed always set.
workqueue has moved away from global_cwqs to worker_pools and with the
scheduled custom worker pools, wforkqueues will be associated with
pools which don't have anything to do with CPUs. The workqueue code
went through significant amount of changes recently and mass renaming
isn't likely to hurt much additionally. Let's replace 'cpu' with
'pool' so that it reflects the current design.
* s/struct cpu_workqueue_struct/struct pool_workqueue/
* s/cpu_wq/pool_wq/
* s/cwq/pwq/
This patch is purely cosmetic.
Signed-off-by: Tejun Heo <tj@kernel.org>
is_chained_work() was added before current_wq_worker() and implemented
its own ham-fisted way of finding out whether %current is a workqueue
worker - it iterates through all possible workers.
Drop the custom implementation and reimplement using
current_wq_worker().
Signed-off-by: Tejun Heo <tj@kernel.org>
c9e7cf273f ("workqueue: move busy_hash from global_cwq to
worker_pool") incorrectly converted is_chained_work() to use
get_gcwq() inside for_each_gcwq_cpu() while removing get_gcwq().
As cwq might not exist for all possible workqueue CPUs, @cwq can be
NULL and the following cwq deferences can lead to oops.
Fix it by using for_each_cwq_cpu() instead, which is the better one to
use anyway as we only need to check pools that the wq is associated
with.
Signed-off-by: Tejun Heo <tj@kernel.org>
The tracing of ia32 compat system calls has been a bit of a pain as they
use different system call numbers than the 64bit equivalents.
I wrote a simple 'lls' program that lists files. I compiled it as a i686
ELF binary and ran it under a x86_64 box. This is the result:
echo 0 > /debug/tracing/tracing_on
echo 1 > /debug/tracing/events/syscalls/enable
echo 1 > /debug/tracing/tracing_on ; ./lls ; echo 0 > /debug/tracing/tracing_on
grep lls /debug/tracing/trace
[.. skipping calls before TS_COMPAT is set ...]
lls-1127 [005] d... 936.409188: sys_recvfrom(fd: 0, ubuf: 4d560fc4, size: 0, flags: 8048034, addr: 8, addr_len: f7700420)
lls-1127 [005] d... 936.409190: sys_recvfrom -> 0x8a77000
lls-1127 [005] d... 936.409211: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22)
lls-1127 [005] d... 936.409215: sys_lgetxattr -> 0xf76ff000
lls-1127 [005] d... 936.409223: sys_dup2(oldfd: 4d55ae9b, newfd: 4)
lls-1127 [005] d... 936.409228: sys_dup2 -> 0xfffffffffffffffe
lls-1127 [005] d... 936.409236: sys_newfstat(fd: 4d55b085, statbuf: 80000)
lls-1127 [005] d... 936.409242: sys_newfstat -> 0x3
lls-1127 [005] d... 936.409243: sys_removexattr(pathname: 3, name: ffcd0060)
lls-1127 [005] d... 936.409244: sys_removexattr -> 0x0
lls-1127 [005] d... 936.409245: sys_lgetxattr(pathname: 0, name: 19614, value: 1, size: 2)
lls-1127 [005] d... 936.409248: sys_lgetxattr -> 0xf76e5000
lls-1127 [005] d... 936.409248: sys_newlstat(filename: 3, statbuf: 19614)
lls-1127 [005] d... 936.409249: sys_newlstat -> 0x0
lls-1127 [005] d... 936.409262: sys_newfstat(fd: f76fb588, statbuf: 80000)
lls-1127 [005] d... 936.409279: sys_newfstat -> 0x3
lls-1127 [005] d... 936.409279: sys_close(fd: 3)
lls-1127 [005] d... 936.421550: sys_close -> 0x200
lls-1127 [005] d... 936.421558: sys_removexattr(pathname: 3, name: ffcd00d0)
lls-1127 [005] d... 936.421560: sys_removexattr -> 0x0
lls-1127 [005] d... 936.421569: sys_lgetxattr(pathname: 4d564000, name: 1b1abc, value: 5, size: 802)
lls-1127 [005] d... 936.421574: sys_lgetxattr -> 0x4d564000
lls-1127 [005] d... 936.421575: sys_capget(header: 4d70f000, dataptr: 1000)
lls-1127 [005] d... 936.421580: sys_capget -> 0x0
lls-1127 [005] d... 936.421580: sys_lgetxattr(pathname: 4d710000, name: 3000, value: 3, size: 812)
lls-1127 [005] d... 936.421589: sys_lgetxattr -> 0x4d710000
lls-1127 [005] d... 936.426130: sys_lgetxattr(pathname: 4d713000, name: 2abc, value: 3, size: 32)
lls-1127 [005] d... 936.426141: sys_lgetxattr -> 0x4d713000
lls-1127 [005] d... 936.426145: sys_newlstat(filename: 3, statbuf: f76ff3f0)
lls-1127 [005] d... 936.426146: sys_newlstat -> 0x0
lls-1127 [005] d... 936.431748: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22)
Obviously I'm not calling newfstat with a fd of 4d55b085. The calls are
obviously incorrect, and confusing.
Other efforts have been made to fix this:
https://lkml.org/lkml/2012/3/26/367
But the real solution is to rewrite the syscall internals and come up
with a fixed solution. One that doesn't require all the kluge that the
current solution has.
Thus for now, instead of outputting incorrect data, simply ignore them.
With this patch the changes now have:
#> grep lls /debug/tracing/trace
#>
Compat system calls simply are not traced. If users need compat
syscalls, then they should just use the raw syscall tracepoints.
For an architecture to make their compat syscalls ignored, it must
define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS (done in asm/ftrace.h) and also
define an arch_trace_is_compat_syscall() function that will return true
if the current task should ignore tracing the syscall.
I want to stress that this change does not affect actual syscalls in any
way, shape or form. It is only used within the tracing system and
doesn't interfere with the syscall logic at all. The changes are
consolidated nicely into trace_syscalls.c and asm/ftrace.h.
I had to make one small modification to asm/thread_info.h and that was
to remove the include of asm/ftrace.h. As asm/ftrace.h required the
current_thread_info() it was causing include hell. That include was
added back in 2008 when the function graph tracer was added:
commit caf4b323 "tracing, x86: add low level support for ftrace return tracing"
It does not need to be included there.
Link: http://lkml.kernel.org/r/1360703939.21867.99.camel@gandalf.local.home
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We're forgetting to reenable local interrupts on an error path.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reported-by: Josh Boyer <jwboyer@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 12ad100046: "clockevents: Add generic timer broadcast function"
made tick_device_uses_broadcast set up the generic broadcast function
for dummy devices (where !tick_device_is_functional(dev)), but neglected
to set up the broadcast function for devices that stop in low power
states (with the CLOCK_EVT_FEAT_C3STOP flag).
When these devices enter low power states they will not have the generic
broadcast function assigned, and will bring down the system when an
attempt is made to broadcast to them.
This patch ensures that the broadcast function is also assigned for
devices which require broadcast in low power states.
Reported-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Stephen Warren <swarren@nvidia.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: nico@linaro.org
Cc: Marc.Zyngier@arm.com
Cc: Will.Deacon@arm.com
Cc: santosh.shilimkar@ti.com
Cc: john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
At present, the value of timeout for freezing is 20s, which is
meaningless in case that one thread is frozen with mutex locked
and another thread is trying to lock the mutex, as this time of
freezing will fail unavoidably.
And if there is no new wakeup event registered, the system will
waste at most 20s for such meaningless trying of freezing.
With this patch, the value of timeout can be configured to smaller
value, so such meaningless trying of freezing will be aborted in
earlier time, and later freezing can be also triggered in earlier
time. And more power will be saved.
In normal case on mobile phone, it costs real little time to freeze
processes. On some platform, it only costs about 20ms to freeze
user space processes and 10ms to freeze kernel freezable threads.
Signed-off-by: Liu Chuansheng <chuansheng.liu@intel.com>
Signed-off-by: Li Fei <fei.li@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
PM_SUSPEND_FREEZE state is a general state that
does not need any platform specific support, it equals
frozen processes + suspended devices + idle processors.
Compared with PM_SUSPEND_MEMORY,
PM_SUSPEND_FREEZE saves less power
because the system is still in a running state.
PM_SUSPEND_FREEZE has less resume latency because it does not
touch BIOS, and the processors are in idle state.
Compared with RTPM/idle,
PM_SUSPEND_FREEZE saves more power as
1. the processor has longer sleep time because processes are frozen.
The deeper c-state the processor supports, more power saving we can get.
2. PM_SUSPEND_FREEZE uses system suspend code path, thus we can get
more power saving from the devices that does not have good RTPM support.
This state is useful for
1) platforms that do not have STR, or have a broken STR.
2) platforms that have an extremely low power idle state,
which can be used to replace STR.
The following describes how PM_SUSPEND_FREEZE state works.
1. echo freeze > /sys/power/state
2. the processes are frozen.
3. all the devices are suspended.
4. all the processors are blocked by a wait queue
5. all the processors idles and enters (Deep) c-state.
6. an interrupt fires.
7. a processor is woken up and handles the irq.
8. if it is a general event,
a) the irq handler runs and quites.
b) goto step 4.
9. if it is a real wake event, say, power button pressing, keyboard touch, mouse moving,
a) the irq handler runs and activate the wakeup source
b) wakeup_source_activate() notifies the wait queue.
c) system starts resuming from PM_SUSPEND_FREEZE
10. all the devices are resumed.
11. all the processes are unfrozen.
12. system is back to working.
Known Issue:
The wakeup of this new PM_SUSPEND_FREEZE state may behave differently
from the previous suspend state.
Take ACPI platform for example, there are some GPEs that only enabled
when the system is in sleep state, to wake the system backk from S3/S4.
But we are not touching these GPEs during transition to PM_SUSPEND_FREEZE.
This means we may lose some wake event.
But on the other hand, as we do not disable all the Interrupts during
PM_SUSPEND_FREEZE, we may get some extra "wakeup" Interrupts, that are
not available for S3/S4.
The patches has been tested on an old Sony laptop, and here are the results:
Average Power:
1. RPTM/idle for half an hour:
14.8W, 12.6W, 14.1W, 12.5W, 14.4W, 13.2W, 12.9W
2. Freeze for half an hour:
11W, 10.4W, 9.4W, 11.3W 10.5W
3. RTPM/idle for three hours:
11.6W
4. Freeze for three hours:
10W
5. Suspend to Memory:
0.5~0.9W
Average Resume Latency:
1. RTPM/idle with a black screen: (From pressing keyboard to screen back)
Less than 0.2s
2. Freeze: (From pressing power button to screen back)
2.50s
3. Suspend to Memory: (From pressing power button to screen back)
4.33s
>From the results, we can see that all the platforms should benefit from
this patch, even if it does not have Low Power S0.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
wait_for_kprobe_optimizer() seems largely broken. It uses
optimizer_comp which is never re-initialized, so
wait_for_kprobe_optimizer() will never wait for anything once
kprobe_optimizer() finishes all pending jobs for the first time.
Also, aside from completion, delayed_work_pending() is %false once
kprobe_optimizer() starts execution and wait_for_kprobe_optimizer()
won't wait for it.
Reimplement it so that it flushes optimizing_work until
[un]optimizing_lists are empty. Note that this also makes
optimizing_work execute immediately if someone's waiting for it, which
is the nicer behavior.
Only compile tested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
At init time, if the system time is "warped" forward in warp_clock()
it will differ from the hardware clock by sys_tz.tz_minuteswest. This time
difference is not taken into account when ntp updates the hardware clock,
and this causes the system time to jump forward by this offset every reboot.
The kernel must take this offset into account when writing the system time
to the hardware clock in the ntp code. This patch adds
persistent_clock_is_local which indicates that an offset has been applied
in warp_clock() and accounts for the "warp" before writing the hardware
clock.
x86 does not have this problem as rtc writes are software limited to a
+/-15 minute window relative to the current rtc time. Other arches, such
as powerpc, however do a full synchronization of the system time to the
rtc and will see this problem.
[v2]: generated against tip/timers/core
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Synchronize with 'net' in order to sort out some l2tp, wireless, and
ipv6 GRE fixes that will be built on top of in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
uprobe_perf_open/close call the costly uprobe_apply() every time,
we can avoid it if:
- "nr_systemwide != 0" is not changed.
- There is another process/thread with the same ->mm.
- copy_proccess() does inherit_event(). dup_mmap() preserves the
inserted breakpoints.
- event->attr.enable_on_exec == T, we can rely on uprobe_mmap()
called by exec/mmap paths.
- tp_target is exiting. Only _close() checks PF_EXITING, I don't
think TRACE_REG_PERF_OPEN can hit the dying task too often.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Change uprobe_trace_func() and uprobe_perf_func() to return "int". Change
uprobe_dispatcher() to return "trace_ret | perf_ret" although this is not
needed, currently TP_FLAG_TRACE/TP_FLAG_PROFILE are mutually exclusive.
The only functional change is that uprobe_perf_func() checks the filtering
too and returns UPROBE_HANDLER_REMOVE if nobody wants to trace current.
Testing:
# perf probe -x /lib/libc.so.6 syscall
# perf record -e probe_libc:syscall -i perl -e 'fork; syscall -1 for 1..10; wait'
# perf report --show-total-period
100.00% 10 perl libc-2.8.so [.] syscall
Before this patch:
# cat /sys/kernel/debug/tracing/uprobe_profile
/lib/libc.so.6 syscall 20
A child process doesn't have a counter, but still it hits this breakoint
"copied" by dup_mmap().
After the patch:
# cat /sys/kernel/debug/tracing/uprobe_profile
/lib/libc.so.6 syscall 11
The child process hits this int3 only once and does unapply_uprobe().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Finally implement uprobe_perf_filter() which checks ->nr_systemwide or
->perf_events to figure out whether we need to insert the breakpoint.
uprobe_perf_open/close are changed to do uprobe_apply(true/false) when
the new perf event comes or goes away.
Note that currently this is very suboptimal:
- uprobe_register() called by TRACE_REG_PERF_REGISTER becomes a
heavy nop, consumer->filter() always returns F at this stage.
As it was already discussed we need uprobe_register_only() to
avoid the costly register_for_each_vma() when possible.
- uprobe_apply() is oftenly overkill. Unless "nr_systemwide != 0"
changes we need uprobe_apply_mm(), unapply_uprobe() is almost
what we need.
- uprobe_apply() can be simply avoided sometimes, see the next
changes.
Testing:
# perf probe -x /lib/libc.so.6 syscall
# perl -e 'syscall -1 while 1' &
[1] 530
# perf record -e probe_libc:syscall perl -e 'syscall -1 for 1..10; sleep 1'
# perf report --show-total-period
100.00% 10 perl libc-2.8.so [.] syscall
Before this patch:
# cat /sys/kernel/debug/tracing/uprobe_profile
/lib/libc.so.6 syscall 79291
A huge ->nrhit == 79291 reflects the fact that the background process
530 constantly hits this breakpoint too, even if doesn't contribute to
the output.
After the patch:
# cat /sys/kernel/debug/tracing/uprobe_profile
/lib/libc.so.6 syscall 10
This shows that only the target process was punished by int3.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Introduce "struct trace_uprobe_filter" which records the "active"
perf_event's attached to ftrace_event_call. For the start we simply
use list_head, we can optimize this later if needed. For example, we
do not really need to record an event with ->parent != NULL, we can
rely on parent->child_list. And we can certainly do some optimizations
for the case when 2 events have the same ->tp_target or tp_target->mm.
Change trace_uprobe_register() to process TRACE_REG_PERF_OPEN/CLOSE
and add/del this perf_event to the list.
We can probably avoid any locking, but lets start with the "obvioulsy
correct" trace_uprobe_filter->rwlock which protects everything.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Currently it is not possible to change the filtering constraints after
uprobe_register(), so a consumer can not, say, start to trace a task/mm
which was previously filtered out, or remove the no longer needed bp's.
Introduce uprobe_apply() which simply does register_for_each_vma() again
to consult uprobe_consumer->filter() and install/remove the breakpoints.
The only complication is that register_for_each_vma() can no longer
assume that uprobe->consumers should be consulter if is_register == T,
so we change it to accept "struct uprobe_consumer *new" instead.
Unlike uprobe_register(), uprobe_apply(true) doesn't do "unregister" if
register_for_each_vma() fails, it is up to caller to handle the error.
Note: we probably need to cleanup the current interface, it is strange
that uprobe_apply/unregister need inode/offset. We should either change
uprobe_register() to return "struct uprobe *", or add a private ->uprobe
member in uprobe_consumer. And in the long term uprobe_apply() should
take a single argument, uprobe or consumer, even "bool add" should go
away.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
sys_perf_event_open()->perf_init_event(event) is called before
find_get_context(event), this means that event->ctx == NULL when
class->reg(TRACE_REG_PERF_REGISTER/OPEN) is called and thus it
can't know if this event is per-task or system-wide.
This patch adds hw_perf_event->tp_target for PERF_TYPE_TRACEPOINT,
this is analogous to PERF_TYPE_BREAKPOINT/bp_target we already have.
The patch also moves ->bp_target up so that it can overlap with the
new member, this can help the compiler to generate the better code.
trace_uprobe_register() will use it for prefiltering to avoid the
unnecessary breakpoints in mm's we do not want to trace.
->tp_target doesn't have its own reference, but we can rely on the
fact that either sys_perf_event_open() holds a reference, or it is
equal to event->ctx->task. So this pointer is always valid until
free_event().
Also add the "struct list_head tp_list" into this union. It is not
strictly necessary, but it can simplify the next changes and we can
add it for free.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Move tu->nhit++ from uprobe_trace_func() to uprobe_dispatcher().
->nhit counts how many time we hit the breakpoint inserted by this
uprobe, we do not want to loose this info if uprobe was enabled by
sys_perf_event_open().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
trace_uprobe->consumer and "struct uprobe_trace_consumer" add the
unnecessary indirection and complicate the code for no reason.
This patch simply embeds uprobe_consumer into "struct trace_uprobe",
all other changes only fix the compilation errors.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
probe_event_enable/disable() check tu->consumer != NULL to avoid the
wrong uprobe_register/unregister().
We are going to kill this pointer and "struct uprobe_trace_consumer",
so we add the new helper, is_trace_uprobe_enabled(), which can rely
on TP_FLAG_TRACE/TP_FLAG_PROFILE instead.
Note: the current logic doesn't look optimal, it is not clear why
TP_FLAG_TRACE/TP_FLAG_PROFILE are mutually exclusive, we will probably
change this later.
Also kill the unused TP_FLAG_UPROBE.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
probe_event_enable/disable() check tu->inode != NULL at the start.
This is ugly, if igrab() can fail create_trace_uprobe() should not
succeed and "postpone" the failure.
And S_ISREG(inode->i_mode) check added by d24d7dbf is not safe.
Note: alloc_uprobe() should probably check igrab() != NULL as well.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
probe_event_enable() does uprobe_register() and only after that sets
utc->tu and tu->consumer/flags. This can race with uprobe_dispatcher()
which can miss these assignments or see them out of order. Nothing
really bad can happen, but this doesn't look clean/safe.
And this does not allow to use uprobe_consumer->filter() we are going
to add, it is called by uprobe_register() and it needs utc->tu.
Change this code to initialize everything before uprobe_register(), and
reset tu->consumer/flags if it fails. We can't race with event_disable(),
the caller holds event_mutex, and if we could the code would be wrong
anyway.
In fact I think uprobe_trace_consumer should die, it buys nothing but
complicates the code. We can simply add uprobe_consumer into trace_uprobe.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
create_trace_uprobe() does kern_path() to find ->d_inode, but forgets
to do path_put(). We can do this right after igrab().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
The original pull message for uprobes (commit 654443e2) noted:
This tree includes uprobes support in 'perf probe' - but SystemTap
(and other tools) can take advantage of user probe points as well.
In order to actually be usable in module-based tools like SystemTap, the
interface needs to be exported. This patch first adds the obvious
exports for uprobe_register and uprobe_unregister. Then it also adds
one for task_user_regset_view, which is necessary to get the correct
state of userspace registers.
Signed-off-by: Josh Stone <jistone@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
utask->xol_vaddr is either zero or valid, remove the bogus
IS_ERR_VALUE() check in xol_free_insn_slot().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
handle_swbp() does get_utask() before can_skip_sstep() for no reason,
we do not need ->utask if can_skip_sstep() succeeds.
Move get_utask() to pre_ssout() who actually starts to use it. Move
the initialization of utask->active_uprobe/state as well. This way
the whole initialization is consolidated in pre_ssout().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
pre_ssout() should do xol_free_insn_slot() if arch_uprobe_pre_xol()
fails, otherwise nobody will free the allocated slot.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
pre_ssout()->xol_get_insn_slot() path is confusing and buggy. This patch
cleanups the code, the next one fixes the bug.
Change xol_get_insn_slot() to only allocate the slot and do nothing more,
move the initialization of utask->xol_vaddr/vaddr into pre_ssout().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Rename add_utask() into get_utask() and change it to allocate on
demand to simplify the caller. Like get_xol_area() it will have
more users.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Currently only xol_get_insn_slot() does get_xol_area() + xol_alloc_area(),
but this will have more users and we do not want to copy-and-paste this
code. This patch simply moves xol_alloc_area() into get_xol_area() to
simplify the current and future code.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Move alloc_page() from xol_add_vma() to xol_alloc_area() to cleanup
the code. This separates the memory allocations and consolidates the
-EALREADY cleanups and the error handling.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Change handle_swbp() to set regs->ip = bp_vaddr in advance, this is
what consumer->handler() needs but uprobe_get_swbp_addr() is not
exported.
This also simplifies the code and makes it more consistent across
the supported architectures. handle_swbp() becomes the only caller
of uprobe_get_swbp_addr().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Currrently the are 2 problems with pre-filtering:
1. It is not possible to add/remove a task (mm) after uprobe_register()
2. A forked child inherits all breakpoints and uprobe_consumer can not
control this.
This patch does the first step to improve the filtering. handler_chain()
removes the breakpoints installed by this uprobe from current->mm if all
handlers return UPROBE_HANDLER_REMOVE.
Note that handler_chain() relies on ->register_rwsem to avoid the race
with uprobe_register/unregister which can add/del a consumer, or even
remove and then insert the new uprobe at the same address.
Perhaps we will add uprobe_apply_mm(uprobe, mm, is_register) and teach
copy_mm() to do filter(UPROBE_FILTER_FORK), but I think this change makes
sense anyway.
Note: instead of checking the retcode from uc->handler, we could add
uc->filter(UPROBE_FILTER_BPHIT). But I think this is not optimal to
call 2 hooks in a row. This buys nothing, and if handler/filter do
something nontrivial they will probably do the same work twice.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Finally add uprobe_consumer->filter() and change consumer_filter()
to actually call this method.
Note that ->filter() accepts mm_struct, not task_struct. Because:
1. We do not have for_each_mm_user(mm, task).
2. Even if we implement for_each_mm_user(), ->filter() can
use it itself.
3. It is not clear who will actually need this interface to
do the "nontrivial" filtering.
Another argument is "enum uprobe_filter_ctx", consumer->filter() can
use it to figure out why/where it was called. For example, perhaps
we can add UPROBE_FILTER_PRE_REGISTER used by build_map_info() to
quickly "nack" the unwanted mm's. In this case consumer should know
that it is called under ->i_mmap_mutex.
See the previous discussion at http://marc.info/?t=135214229700002
Perhaps we should pass more arguments, vma/vaddr?
Note: this patch obviously can't help to filter out the child created
by fork(), this will be addressed later.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
filter_chain() was added into install_breakpoint/remove_breakpoint to
simplify the initial changes but this is sub-optimal.
This patch shifts the callsite to the callers, register_for_each_vma()
and uprobe_mmap(). This way:
- It will be easier to add the new arguments. This is the main reason,
we can do more optimizations later.
- register_for_each_vma(is_register => true) can be optimized, we only
need to consult the new consumer. The previous consumers were already
asked when they called uprobe_register().
This patch also moves the MMF_HAS_UPROBES check from remove_breakpoint(),
this allows to avoid the potentionally costly filter_chain(). Note that
register_for_each_vma(is_register => false) doesn't really need to take
->consumer_rwsem, but I don't think it makes sense to optimize this and
introduce filter_chain_lockless().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
uprobe_register() and uprobe_unregister() are the only users of
mutex_lock(uprobes_hash(inode)), and the only reason why we can't
simply remove it is that we need to ensure that delete_uprobe() is
not possible after alloc_uprobe() and before consumer_add().
IOW, we need to ensure that when we take uprobe->register_rwsem
this uprobe is still valid and we didn't race with _unregister()
which called delete_uprobe() in between.
With this patch uprobe_register() simply checks uprobe_is_active()
and retries if it hits this very unlikely race. uprobes_mutex[] is
no longer needed and can be removed.
There is another reason for this change, prepare_uprobe() should be
folded into alloc_uprobe() and we do not want to hold the extra locks
around read_mapping_page/etc.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
The lifetime of uprobe->rb_node and uprobe->inode is not refcounted,
delete_uprobe() is called when we detect that uprobe has no consumers,
and it would be deadly wrong to do this twice.
Change delete_uprobe() to WARN() if it was already called. We use
RB_CLEAR_NODE() to mark uprobe "inactive", then RB_EMPTY_NODE() can
be used to detect this case.
RB_EMPTY_NODE() is not used directly, we add the trivial helper for
the next change.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
uprobe_events counts the number of uprobes in uprobes_tree but
it is used as a boolean. We can use RB_EMPTY_ROOT() instead.
Probably no_uprobe_events() added by this patch can have more
callers, say, mmf_recalc_uprobes().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Now that ->register_rwsem is safe under ->mmap_sem we can kill
->copy_mutex and abuse down_write(&uprobe->consumer_rwsem).
This makes prepare_uprobe() even more ugly, but we should kill
it anyway.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Simply remove UPROBE_RUN_HANDLER and the corresponding code.
It can only help if uprobe has a single consumer, and in fact
it is no longer needed after handler_chain() was changed to use
->register_rwsem, we simply can not race with uprobe_register().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Now that it safe to use ->consumer_rwsem under ->mmap_sem we can
almost finish the implementation of filter_chain(). It still lacks
the actual uc->filter(...) call but othewrwise it is ready, just
it pretends that ->filter() always returns true.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Introduce uprobe->register_rwsem. It is taken for writing around
__uprobe_register/unregister.
Change handler_chain() to use this sem rather than consumer_rwsem.
The main reason for this change is that we have the nasty problem
with mmap_sem/consumer_rwsem dependency. filter_chain() needs to
protect uprobe->consumers like handler_chain(), but they can not
use the same lock. filter_chain() can be called under ->mmap_sem
(currently this is always true), but we want to allow ->handler()
to play with the probed task's memory, and this needs ->mmap_sem.
Alternatively we could use srcu, but synchronize_srcu() is very
slow and ->register_rwsem allows us to do more. In particular, we
can teach handler_chain() to do remove_breakpoint() if this bp is
"nacked" by all consumers, we know that we can't race with the
new consumer which does uprobe_register().
See also the next patches. uprobes_mutex[] is almost ready to die.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To support the filtering uprobe_register() should do
register_for_each_vma(true) every time the new consumer comes,
we need to install the previously nacked breakpoints.
Note:
- uprobes_mutex[] should die, what it actually protects is
alloc_uprobe().
- UPROBE_RUN_HANDLER should die too, obviously it can't work
unless uprobe has a single consumer. The consumer should
serialize with _register/_unregister itself. Or this flag
should live in uprobe_consumer->state.
- Perhaps we can do some optimizations later. For example, if
filter_chain() never returns false uprobe can record this
fact and avoid the unnecessary register_for_each_vma().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
uprobe_unregister() removes the breakpoints only if the last consumer
goes away. To support the filtering it should do this every time, we
want to remove the breakpoints which nobody else want to keep.
Note: given that filter_chain() is not actually implemented, this patch
itself doesn't change the behaviour yet, register_for_each_vma(false)
is a heavy "nop" unless there are no more consumers.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Add the new helper filter_chain(). Currently it is only placeholder,
the comment explains what is should do. We will change it later to
consult every consumer to decide whether we need to install the swbp.
Until then it works as if any consumer returns true, this matches the
current behavior.
Change install_breakpoint() to call filter_chain() instead of checking
uprobe->consumers != NULL. We obviously need this, and this equally
closes the race with _unregister().
Change remove_breakpoint() to call this helper too. Currently this is
pointless because remove_breakpoint() is only called when the last
consumer goes away, but we will change this.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
uprobe_consumer->filter() is pointless in its current form, kill it.
We will add it back, but with the different signature/semantics. Perhaps
we will even re-introduce the callsite in handler_chain(), but not to
just skip uc->handler().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
register/unregister verifies that inode/uc != NULL. For what?
This really looks like "hide the potential problem", the caller
should pass the valid data.
register() also checks uc->next == NULL, probably to prevent the
double-register but the caller can do other stupid/wrong things.
If we do this check, then we should document that uc->next should
be cleared before register() and add BUG_ON().
Also add the small comment about the i_size_read() check.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cosmetic. __set_bit(UPROBE_SKIP_SSTEP) is the part of initialization,
it is not clear why it is set in insert_uprobe().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
This reverts commit ff0d05bf73.
Now that we have all the locking fixes in place, we can revert the
revert. This re-enables lockdep tracking for the console lock,
daee779718.
Signed-off-by: Dave Airlie <airlied@redhat.com>
defined(@array) is deprecated in Perl and gives off a warning.
Restructure the code to remove that warning.
[ hpa: it would be interesting to revert to the timeconst.bc script.
It appears that the failures reported by akpm during testing of
that script was due to a known broken version of make, not a problem
with bc. The Makefile rules could probably be restructured to avoid
the make bug, or it is probably old enough that it doesn't matter. ]
Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org>
The old SRCU implementation loads sp->completed within an
RCU-sched section, courtesy of preempt_disable(). This was required
due to the use of synchronize_sched() in the old implemenation's
synchronize_srcu(). However, the new implementation does not rely
on synchronize_sched(), so it in turn does not require the load of
sp->completed and the ->c[] counter to be in a single preempt-disabled
region of code. This commit therefore moves the sp->completed access
outside of the preempt-disabled region and applies ACCESS_ONCE().
The resulting code is almost as the same as before, but it removes the
now-misleading rcu_dereference_index_check() call.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Because synchronize_srcu_expedited() no longer uses
synchronize_rcu_sched_expedited(), synchronize_srcu_expedited() no longer
indirectly acquires any CPU-hotplug-related locks. This commit therefore
updates the comments accordingly.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The core of SRCU is changed, but synchronize_srcu()'s comments describe
the old algorithm. This commit therefore updates them to match the
new algorithm.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Pack six lines of code into two lines.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although synchronize_srcu() can sleep, it will not sleep if the fast
path succeeds, which means that illegal use of synchronize_rcu()
might go unnoticed. This commit therefore adds might_sleep(), which
unconditionally catches illegal use of synchronize_rcu() from atomic
context.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit replaces disabling of preemption and decrement of a per-CPU
variable with this_cpu_dec(), which avoids preemption disabling on x86
and shortens the code on all platforms.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, __queue_work() chooses the pool to queue a work item to and
then determines cwq from the target wq and the chosen pool. This is a
bit backwards in that we can determine cwq first and simply use
cwq->pool. This way, we can skip get_std_worker_pool() in queueing
path which will be a hurdle when implementing custom worker pools.
Update __queue_work() such that it chooses the target cwq and then use
cwq->pool instead of the other way around. While at it, add missing
{} in an if statement.
This patch doesn't introduce any functional changes.
tj: The original patch had two get_cwq() calls - the first to
determine the pool by doing get_cwq(cpu, wq)->pool and the second
to determine the matching cwq from get_cwq(pool->cpu, wq).
Updated the function such that it chooses cwq instead of pool and
removed the second call. Rewrote the description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
get_work_pool_id() currently first obtains pool using get_work_pool()
and then return pool->id. For an off-queue work item, this involves
obtaining pool ID from worker->data, performing idr_find() to find the
matching pool and then returning its pool->id which of course is the
same as the one which went into idr_find().
Just open code WORK_STRUCT_CWQ case and directly return pool ID from
work->data.
tj: The original patch dropped on-queue work item handling and renamed
the function to offq_work_pool_id(). There isn't much benefit in
doing so. Handling it only requires a single if() and we need at
least BUG_ON(), which is also a branch, even if we drop on-queue
handling. Open code WORK_STRUCT_CWQ case and keep the function in
line with get_work_pool(). Rewrote the description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
As nr_running is likely to be accessed from other CPUs during
try_to_wake_up(), it was kept outside worker_pool; however, while less
frequent, other fields in worker_pool are accessed from other CPUs
for, e.g., non-reentrancy check. Also, with recent pool related
changes, accessing nr_running matching the worker_pool isn't as simple
as it used to be.
Move nr_running inside worker_pool. Keep it aligned to cacheline and
define CPU pools using DEFINE_PER_CPU_SHARED_ALIGNED(). This should
give at least the same cacheline behavior.
get_pool_nr_running() is replaced with direct pool->nr_running
accesses.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Move rt scheduler definitions out of include/linux/sched.h into
new file include/linux/sched/rt.h
Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add a /proc/sys/kernel scheduler knob named
sched_rr_timeslice_ms that allows global changing of the
SCHED_RR timeslice value. User visable value is in milliseconds
but is stored as jiffies. Setting to 0 (zero) resets to the
default (currently 100ms).
Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094704.13751796@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the sysctl-related bits from include/linux/sched.h into
a new file: include/linux/sched/sysctl.h. Then update source
files requiring access to those bits by including the new
header file.
Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With the recent is-work-queued-here test simplification, the nested
if() in try_to_grab_pending() can be collapsed. Collapse it.
This patch is purely cosmetic.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Currently, determining whether a work item is queued on a locked pool
involves somewhat convoluted memory barrier dancing. It goes like the
following.
* When a work item is queued on a pool, work->data is updated before
work->entry is linked to the pending list with a wmb() inbetween.
* When trying to determine whether a work item is currently queued on
a pool pointed to by work->data, it locks the pool and looks at
work->entry. If work->entry is linked, we then do rmb() and then
check whether work->data points to the current pool.
This works because, work->data can only point to a pool if it
currently is or were on the pool and,
* If it currently is on the pool, the tests would obviously succeed.
* It it left the pool, its work->entry was cleared under pool->lock,
so if we're seeing non-empty work->entry, it has to be from the work
item being linked on another pool. Because work->data is updated
before work->entry is linked with wmb() inbetween, work->data update
from another pool is guaranteed to be visible if we do rmb() after
seeing non-empty work->entry. So, we either see empty work->entry
or we see updated work->data pointin to another pool.
While this works, it's convoluted, to put it mildly. With recent
updates, it's now guaranteed that work->data points to cwq only while
the work item is queued and that updating work->data to point to cwq
or back to pool is done under pool->lock, so we can simply test
whether work->data points to cwq which is associated with the
currently locked pool instead of the convoluted memory barrier
dancing.
This patch replaces the memory barrier based "are you still here,
really?" test with much simpler "does work->data points to me?" test -
if work->data points to a cwq which is associated with the currently
locked pool, the work item is guaranteed to be queued on the pool as
work->data can start and stop pointing to such cwq only under
pool->lock and the start and stop coincide with queue and dequeue.
tj: Rewrote the comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We plan to use work->data pointing to cwq as the synchronization
invariant when determining whether a given work item is on a locked
pool or not, which requires work->data pointing to cwq only while the
work item is queued on the associated pool.
With delayed_work updated not to overload work->data for target
workqueue recording, the only case where we still have off-queue
work->data pointing to cwq is try_to_grab_pending() which doesn't
update work->data after stealing a queued work item. There's no
reason for try_to_grab_pending() to not update work->data to point to
the pool instead of cwq, like the normal execution does.
This patch adds set_work_pool_and_keep_pending() which makes
work->data point to pool instead of cwq but keeps the pending bit
unlike set_work_pool_and_clear_pending() (surprise!).
After this patch, it's guaranteed that only queued work items point to
cwqs.
This patch doesn't introduce any visible behavior change.
tj: Renamed the new helper function to match
set_work_pool_and_clear_pending() and rewrote the description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
To avoid executing the same work item from multiple CPUs concurrently,
a work_struct records the last pool it was on in its ->data so that,
on the next queueing, the pool can be queried to determine whether the
work item is still executing or not.
A delayed_work goes through timer before actually being queued on the
target workqueue and the timer needs to know the target workqueue and
CPU. This is currently achieved by modifying delayed_work->work.data
such that it points to the cwq which points to the target workqueue
and the last CPU the work item was on. __queue_delayed_work()
extracts the last CPU from delayed_work->work.data and then combines
it with the target workqueue to create new work.data.
The only thing this rather ugly hack achieves is encoding the target
workqueue into delayed_work->work.data without using a separate field,
which could be a trade off one can make; unfortunately, this entangles
work->data management between regular workqueue and delayed_work code
by setting cwq pointer before the work item is actually queued and
becomes a hindrance for further improvements of work->data handling.
This can be easily made sane by adding a target workqueue field to
delayed_work. While delayed_work is used widely in the kernel and
this does make it a bit larger (<5%), I think this is the right
trade-off especially given the prospect of much saner handling of
work->data which currently involves quite tricky memory barrier
dancing, and don't expect to see any measureable effect.
Add delayed_work->wq and drop the delayed_work->work.data overloading.
tj: Rewrote the description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, work_busy() first tests whether the work has a pool
associated with it and if not, considers it idle. This works fine
even for delayed_work.work queued on timer, as __queue_delayed_work()
sets cwq on delayed_work.work - a queued delayed_work always has its
cwq and thus pool associated with it.
However, we're about to update delayed_work queueing and this won't
hold. Update work_busy() such that it tests WORK_STRUCT_PENDING
before the associated pool. This doesn't make any noticeable behavior
difference now.
With work_pending() test moved, the function read a lot better with
"if (!pool)" test flipped to positive. Flip it.
While at it, lose the comment about now non-existent reentrant
workqueues.
tj: Reorganized the function and rewrote the description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that workqueue has moved away from gcwqs, workqueue no longer has
the need to have a CPU identifier indicating "no cpu associated" - we
now use WORK_OFFQ_POOL_NONE instead - and most uses of WORK_CPU_NONE
are gone.
The only left usage is as the end marker for for_each_*wq*()
iterators, where the name WORK_CPU_NONE is confusing w/o actual
WORK_CPU_NONE usages. Similarly, WORK_CPU_LAST which equals
WORK_CPU_NONE no longer makes sense.
Replace both WORK_CPU_NONE and LAST with WORK_CPU_END. This patch
doesn't introduce any functional difference.
tj: s/WORK_CPU_LAST/WORK_CPU_END/ and rewrote the description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
All in-kernel users of class_find_device() don't really need mutable
data for match callback.
In two places (kernel/power/suspend_test.c, drivers/scsi/osd/osd_uld.c)
this patch changes match callbacks to use const search data.
The const is propagated to rtc_class_open() and power_supply_get_by_name()
parameters.
Note that there's a dev reference leak in suspend_test.c that's not
touched in this patch.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
TCP Appropriate Byte Count was added by me, but later disabled.
There is no point in maintaining it since it is a potential source
of bugs and Linux already implements other better window protection
heuristics.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Typical cputime stats infrastructure relies on the timer tick and
its periodic polling on the CPU to account the amount of time
spent by the CPUs and the tasks per high level domains such as
userspace, kernelspace, guest, ...
Now we are preparing to implement full dynticks capability on
Linux for Real Time and HPC users who want full CPU isolation.
This feature requires a cputime accounting that doesn't depend
on the timer tick.
To implement it, this new cputime infrastructure plugs into
kernel/user/guest boundaries to take snapshots of cputime and
flush these to the stats when needed. This performs pretty
much like CONFIG_VIRT_CPU_ACCOUNTING except that context location
and cputime snaphots are synchronized between write and read
side such that the latter can safely retrieve the pending tickless
cputime of a task and add it to its latest cputime snapshot to
return the correct result to the user.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJRBsKnAAoJEIUkVEdQjox3lMgP/2R6DU2f8PyGIao3hne4M3Pu
L3q+mAG53b24Dy014KeW7gd8yv45fE7wp/rs8CGLte9VzbLkRCDSFQPgBuXVagRj
tV5nfAuqD0wHTnA+HhBE3l3C2RKAPGIu79rBpnIR/QIPPl8Z3Dby8YgmxEQKDf8G
j7MEBu2LthSuqEi2ZXemnO5r0oEnQAzAp4TTi/M38k0Fmt59nOGyjLnI+xHYCBMa
1pnz7j3jjR9NJExGu8iVvbo+jupuQngP8qmkLXHvYnj/TEJNwzO1hHVoSwOpjYpS
9ycl+T8IKQLbAkBywLtq3Mzde43xt/t8wYyGZ0oAV+Z7MIpz/9YIfDJwqQeqoNbD
dAdbNjKMbsxCgmrnyqSagfMQg/r3CPZ4vf40TMCaN4gNUJC4Ie+E4kPRKRh59+PB
Ukthmqujn0f40LAa+HXTUuzafd3b0s/ewH+8FuQ6LAG9b5+WnoN8JTJ5u6+ydokO
ZleeOowuRZZEg+abQ8Sm2GRm/BzN29gi/npb//I+ZDXWv/+3yccgsiPjCRzCAAaO
g1RmYryFSRUwHQbGNNypVWVuOLWvrBQ4jqbGO7BBuBByZMSHryKxR6mb+inH3qLE
xIDM9SdSJisc292OzoFKwVZki4MaXaadJXJduVvqYlZQvXXs7eAa4wo3euhtVITD
NLQO5OZXE4oIQmDFb0FV
=1Tzp
-----END PGP SIGNATURE-----
Merge tag 'full-dynticks-cputime-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into sched/core
Pull full-dynticks (user-space execution is undisturbed and
receives no timer IRQs) preparation changes that convert the
cputime accounting code to be full-dynticks ready,
from Frederic Weisbecker:
"This implements the cputime accounting on full dynticks CPUs.
Typical cputime stats infrastructure relies on the timer tick and
its periodic polling on the CPU to account the amount of time
spent by the CPUs and the tasks per high level domains such as
userspace, kernelspace, guest, ...
Now we are preparing to implement full dynticks capability on
Linux for Real Time and HPC users who want full CPU isolation.
This feature requires a cputime accounting that doesn't depend
on the timer tick.
To implement it, this new cputime infrastructure plugs into
kernel/user/guest boundaries to take snapshots of cputime and
flush these to the stats when needed. This performs pretty
much like CONFIG_VIRT_CPU_ACCOUNTING except that context location
and cputime snaphots are synchronized between write and read
side such that the latter can safely retrieve the pending tickless
cputime of a task and add it to its latest cputime snapshot to
return the correct result to the user."
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In 7b270f6099 "sched: Bail out of yield_to when source and
target runqueue has one task" we changed this to store -ESRCH so
it needs to be signed.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: kbuild@01.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <efault@gmx.de>
Link: http://lkml.kernel.org/r/20130205113751.GA20521@elgon.mountain
Signed-off-by: Ingo Molnar <mingo@kernel.org>
hrtimer_enqueue_reprogram contains a race which could result in
timer.base switch during unlock/lock sequence.
hrtimer_enqueue_reprogram is releasing the lock protecting the timer
base for calling raise_softirq_irqsoff() due to a lock ordering issue
versus rq->lock.
If during that time another CPU calls __hrtimer_start_range_ns() on
the same hrtimer, the timer base might switch, before the current CPU
can lock base->lock again and therefor the unlock_timer_base() call
will unlock the wrong lock.
[ tglx: Added comment and massaged changelog ]
Signed-off-by: Leonid Shatz <leonid.shatz@ravellosystems.com>
Signed-off-by: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1359981217-389-1-git-send-email-izik.eidus@ravellosystems.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Conflicts:
kernel/irq_work.c
Add support for printk in full dynticks CPU.
* Don't stop tick with irq works pending. This
fix is generally useful and concerns archs that
can't raise self IPIs.
* Flush irq works before CPU offlining.
* Introduce "lazy" irq works that can wait for the
next tick to be executed, unless it's stopped.
* Implement klogd wake up using irq work. This
removes the ad-hoc printk_tick()/printk_needs_cpu()
hooks and make it working even in dynticks mode.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
There's no reason kgdb.h itself needs to include the 8250 serial port
header file. So push it down to the _very_ limited number of individual
drivers that need the values in that file, and fix up the places where
people really wanted serial_core.h and platform_device.h.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull scheduler fixes from Ingo Molnar:
"Three small fixlets"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/debug: Fix format string for 32-bit platforms
sched: Fix warning in kernel/sched/fair.c
sched/rt: Use root_domain of rt_rq not current processor
Pull perf fixes from Ingo Molnar:
"Three fixlets and two small (and low risk) hw-enablement changes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix event group context move
x86/perf: Add IvyBridge EP support
perf/x86: Fix P6 driver section warning
arch/x86/tools/insn_sanity.c: Identify source of messages
perf/x86: Enable Intel Lincroft/Penwell/Cloverview Atom support
Pull two small RCU fixlets from Ingo Molnar.
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Make rcu_nocb_poll an early_param instead of module_param
rcu: Prevent soft-lockup complaints about no-CBs CPUs
The uses of trace_clock_local() are dead code when CONFIG_RCU_TRACE=n,
but some compilers might nevertheless generate code calling this function.
This commit therefore ensures that trace_clock_local() is invoked only
when CONFIG_RCU_TRACE=y.
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the previous CPU is cache affine and idle, select it.
The current implementation simply traverses the sd_llc domain,
taking the first idle CPU encountered, which walks buddy pairs
hand in hand over the package, inflicting excruciating pain.
1 tbench pair (worst case) in a 10 core + SMT package:
pre 15.22 MB/sec 1 procs
post 252.01 MB/sec 1 procs
Signed-off-by: Mike Galbraith <bitbucket@online.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1359371965.5783.127.camel@marge.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As no one is using the return value of irq_work_queue(),
so it is better to just make it void.
Signed-off-by: anish kumar <anish198519851985@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
[ Fix stale comments, remove now unnecessary __irq_work_queue() intermediate function ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1359925703-24304-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
mips was the last architecture not using the generic variant.
Both native and compat variants switched to generic, which is
made unconditional now.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
usual "call force_sigsegv or signal_delivered" logics. Takes
ksignal instead of separate siginfo/k_sigaction.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Again, protected by a temporary config symbol (GENERIC_COMPAT_RT_SIGACTION);
will be gone by the end of series.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and make it unconditional - we want the sucker on all biarch
platforms, really. All kinds of wrappers and private implementations
can go now; fortunately, they don't cause name conflicts, so we can
do that one first without any bisect hazards.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
conditional on GENERIC_COMPAT_RT_SIGQUEUEINFO; by the end of that series
it will become the same thing as COMPAT and conditional will die out.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
conditional on GENERIC_COMPAT_RT_SIGPENDING; by the end of that series
it will become the same thing as COMPAT and conditional will die out.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
conditional on GENERIC_COMPAT_RT_SIGPROCMASK; by the end of that series
it will become the same thing as COMPAT and conditional will die out.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* pull compat version alongside with the native one
* make little-endian compat variant just call the native
* don't bother with separate conditional for compat (both native and
compat are going to become unconditional very soon).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Switch from __ARCH_WANT_SYS_RT_SIGACTION to opposite
(!CONFIG_ODD_RT_SIGACTION); the only two architectures that
need it are alpha and sparc. The reason for use of CONFIG_...
instead of __ARCH_... is that it's needed only kernel-side
and doing it that way avoids a mess with include order on many
architectures.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
again, strictly speaking we are in nasal daemon territory on ppc
and mips - we need to sign-extend int arguments.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Function next_prio() has been removed and pull_rt_task() is the
only user of pick_next_highest_task_rt() at the moment.
pull_rt_task is not interested in p->nr_cpus_allowed, its only
interest is the fact that cpu is allowed to execute p. If
nr_cpus_allowed == 1, cpu != task_cpu(p) and cpu is allowed then
it means that task p is in the middle of the migration
techniques; the task waits until it is moved by migration
thread. So, lets pull it earlier.
Signed-off-by: Kirill V Tkhai <tkhai@yandex.ru>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
CC: linux-rt-users <linux-rt-users@vger.kernel.org>
Link: http://lkml.kernel.org/r/70871359644177@web16d.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When we have group with mixed events (hw/sw) we want to end up
with group leader being in hw context. So if group leader is
initialy sw event, we move all the events under hw context.
The move is done for each event by removing it from its context
and adding it back into proper one. As a part of the removal the
event is automatically disabled, which is not what we want at
this stage of creating groups.
The fix is to initialize event state after removal from sw
context.
This fix resulted from the following discussion:
http://thread.gmane.org/gmane.linux.kernel.perf.user/1144
Reported-by: Andreas Hollmann <hollmann@in.tum.de>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vince@deater.net>
Link: http://lkml.kernel.org/r/1359714225-4231-1-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On early boot up, when the ftrace ring buffer is initialized, the
static variable current_trace is initialized to &nop_trace.
Before this initialization, current_trace is NULL and will never
become NULL again. It is always reassigned to a ftrace tracer.
Several places check if current_trace is NULL before it uses
it, and this check is frivolous, because at the point in time
when the checks are made the only way current_trace could be
NULL is if ftrace failed its allocations at boot up, and the
paths to these locations would probably not be possible.
By initializing current_trace to &nop_trace where it is declared,
current_trace will never be NULL, and we can remove all these
checks of current_trace being NULL which never needed to be
checked in the first place.
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, the timer broadcast mechanism is defined by a function
pointer on struct clock_event_device. As the fundamental mechanism for
broadcast is architecture-specific, this means that clock_event_device
drivers cannot be shared across multiple architectures.
This patch adds an (optional) architecture-specific function for timer
tick broadcast, allowing drivers which may require broadcast
functionality to be shared across multiple architectures.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: nico@linaro.org
Cc: Will.Deacon@arm.com
Cc: Marc.Zyngier@arm.com
Cc: john.stultz@linaro.org
Link: http://lkml.kernel.org/r/1358183124-28461-3-git-send-email-mark.rutland@arm.com
Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently the broadcast mechanism used for timers is abstracted by a
function pointer on struct clock_event_device. As the fundamental
mechanism for broadcast is architecture-specific, this ties each
clock_event_device driver to a single architecture, even where the
driver is otherwise generic.
This patch adds a standard path for the receipt of timer broadcasts, so
drivers and/or architecture backends need not manage redundant lists of
timers for the purpose of routing broadcast timer ticks.
[tglx: Made the implementation depend on the config switch as well ]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: nico@linaro.org
Cc: Will.Deacon@arm.com
Cc: Marc.Zyngier@arm.com
Cc: john.stultz@linaro.org
Link: http://lkml.kernel.org/r/1358183124-28461-2-git-send-email-mark.rutland@arm.com
Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There are several places of consecutive calls of
dequeue_task_rt() and put_prev_task_rt() in the scheduler.
For example, function rt_mutex_setprio() does it.
The both calls lead to update_curr_rt(), the second of it
receives zeroed delta_exec. The only effective action in this
case is call of sched_rt_avg_update(), which can change
rq->age_stamp and rq->rt_avg. But it is possible in case of
""floating"" rq->clock. This fact is not reasonable to be
accounted. Another actions do nothing.
Signed-off-by: Kirill V Tkhai <tkhai@yandex.ru>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
CC: linux-rt-users <linux-rt-users@vger.kernel.org>
Link: http://lkml.kernel.org/r/931541359550236@web1g.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Peter Anvin:
"This is a collection of miscellaneous fixes, the most important one is
the fix for the Samsung laptop bricking issue (auto-blacklisting the
samsung-laptop driver); the efi_enabled() changes you see below are
prerequisites for that fix.
The other issues fixed are booting on OLPC XO-1.5, an UV fix, NMI
debugging, and requiring CAP_SYS_RAWIO for MSR references, just as
with I/O port references."
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
samsung-laptop: Disable on EFI hardware
efi: Make 'efi_enabled' a function to query EFI facilities
smp: Fix SMP function call empty cpu mask race
x86/msr: Add capabilities check
x86/dma-debug: Bump PREALLOC_DMA_DEBUG_ENTRIES
x86/olpc: Fix olpc-xo1-sci.c build errors
arch/x86/platform/uv: Fix incorrect tlb flush all issue
x86-64: Fix unwind annotations in recent NMI changes
x86-32: Start out cr0 clean, disable paging before modifying cr3/4
This reverts commit daee779718.
I'll requeue this after the console locking fixes, so lockdep
is useful again for people until fbcon is fixed.
Signed-off-by: Dave Airlie <airlied@redhat.com>
Ftrace has a snapshot feature available from kernel space and
latency tracers (e.g. irqsoff) are using it. This patch enables
user applictions to take a snapshot via debugfs.
Add "snapshot" debugfs file in "tracing" directory.
snapshot:
This is used to take a snapshot and to read the output of the
snapshot.
# echo 1 > snapshot
This will allocate the spare buffer for snapshot (if it is
not allocated), and take a snapshot.
# cat snapshot
This will show contents of the snapshot.
# echo 0 > snapshot
This will free the snapshot if it is allocated.
Any other positive values will clear the snapshot contents if
the snapshot is allocated, or return EINVAL if it is not allocated.
Link: http://lkml.kernel.org/r/20121226025300.3252.86850.stgit@liselsia
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
[
Fixed irqsoff selftest and also a conflict with a change
that fixes the update_max_tr.
]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the trace buffer read functions use a static variable
"old_tracer" for detecting if the current tracer changes. This
was suitable for a single trace file ("trace"), but to add a
snapshot feature that will use the same function for its file,
a check against a static variable is not sufficient.
To use the output functions for two different files, instead of
storing the current tracer in a static variable, as the trace
iterator descriptor contains a pointer to the original current
tracer's name, that pointer can now be used to check if the
current tracer has changed between different reads of the trace
file.
Link: http://lkml.kernel.org/r/20121226025252.3252.9276.stgit@liselsia
Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
For systems with an unstable sched_clock, all cpu_clock() does is enable/
disable local irq during the call to sched_clock_cpu(). And for stable
systems they are same.
trace_clock_global() already disables interrupts, so it can call
sched_clock_cpu() directly.
Link: http://lkml.kernel.org/r/1356576585-28782-2-git-send-email-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a stat about the number of events read from the ring buffer:
# cat /debug/tracing/per_cpu/cpu0/stats
entries: 39869
overrun: 870512
commit overrun: 0
bytes: 1449912
oldest event ts: 6561.368690
now ts: 6565.246426
dropped events: 0
read events: 112 <-- Added
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
During kdump kernel's booting stage, it need to find low ram for
swiotlb buffer when system does not support intel iommu/dmar remapping.
kexed-tools is appending memmap=exactmap and range from /proc/iomem
with "Crash kernel", and that range is above 4G for 64bit after boot
protocol 2.12.
We need to add another range in /proc/iomem like "Crash kernel low",
so kexec-tools could find that info and append to kdump kernel
command line.
Try to reserve some under 4G if the normal "Crash kernel" is above 4G.
User could specify the size with crashkernel_low=XX[KMG].
-v2: fix warning that is found by Fengguang's test robot.
-v3: move out get_mem_size change to another patch, to solve compiling
warning that is found by Borislav Petkov <bp@alien8.de>
-v4: user must specify crashkernel_low if system does not support
intel or amd iommu.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-31-git-send-email-yinghai@kernel.org
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Rob Landley <rob@landley.net>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Jason pointed out the HAS_PERSISTENT_CLOCK name isn't
quite accurate for the config, as some systems may have
the persistent_clock in some cases, but not always.
So change the config name to the more clear
ALWAYS_USE_PERSISTENT_CLOCK.
Signed-off-by: John Stultz <john.stultz@linaro.org>
While debugging the virtual cputime with the function graph tracer
with a max_depth of 1 (most common use of the max_depth so far),
I found that I was missing kernel execution because of a race condition.
The code for the return side of the function has a slight race:
ftrace_pop_return_trace(&trace, &ret, frame_pointer);
trace.rettime = trace_clock_local();
ftrace_graph_return(&trace);
barrier();
current->curr_ret_stack--;
The ftrace_pop_return_trace() initializes the trace structure for
the callback. The ftrace_graph_return() uses the trace structure
for its own use as that structure is on the stack and is local
to this function. Then the curr_ret_stack is decremented which
is what the trace.depth is set to.
If an interrupt comes in after the ftrace_graph_return() but
before the curr_ret_stack, then the called function will get
a depth of 2. If max_depth is set to 1 this function will be
ignored.
The problem is that the trace has already been called, and the
timestamp for that trace will not reflect the time the function
was about to re-enter userspace. Calls to the interrupt will not
be traced because the max_depth has prevented this.
To solve this issue, the ftrace_graph_return() can safely be
moved after the current->curr_ret_stack has been updated.
This way the timestamp for the return callback will reflect
the actual time.
If an interrupt comes in after the curr_ret_stack update and
ftrace_graph_return(), it will be traced. It may look a little
confusing to see it within the other function, but at least
it will not be lost.
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Bring in the 'net' tree so that we can get some ipv4/ipv6 bug
fixes that some net-next work will build upon.
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.
(History: Raghavendra initially worked on break out of kvm ple handler upon
seeing source runqueue length = 1, but it had to export rq length).
Peter came up with the elegant idea of return -ESRCH in scheduler core.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Andrew Jones <drjones@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
doctorture.2013.01.11a: Changes to rcutorture and to RCU documentation.
fixes.2013.01.26a: Miscellaneous fixes.
tagcb.2013.01.24a: Tag RCU callbacks with grace-period number to
simplify callback advancement.
tiny.2013.01.29b: Enhancements to uniprocessor handling in tiny RCU.
A number of kthreads have been added to rcutorture, but the shuffler
task was not informed of them, and thus did not shuffle them. This
commit therefore adds the requisite shuffling, and, while in the area
fixes up some whitespace issues.
However, the shuffling is intended to keep randomly selected CPUs
idle, which means that the RCU priority boosting kthreads need to
avoid waking up every jiffy. This commit also makes that fix.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tiny RCU has historically omitted RCU CPU stall warnings in order to
reduce memory requirements, however, lack of these warnings caused
Thomas Gleixner some debugging pain recently. Therefore, this commit
adds RCU CPU stall warnings to tiny RCU if RCU_TRACE=y. This keeps
the memory footprint small, while still enabling CPU stall warnings
in kernels built to enable them.
Updated to include Josh Triplett's suggested use of RCU_STALL_COMMON
config variable to simplify #if expressions.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
I get the following warning every day with v3.7, once or
twice a day:
[ 2235.186027] WARNING: at /mnt/sda7/kernel/linux/arch/x86/kernel/apic/ipi.c:109 default_send_IPI_mask_logical+0x2f/0xb8()
As explained by Linus as well:
|
| Once we've done the "list_add_rcu()" to add it to the
| queue, we can have (another) IPI to the target CPU that can
| now see it and clear the mask.
|
| So by the time we get to actually send the IPI, the mask might
| have been cleared by another IPI.
|
This patch also fixes a system hang problem, if the data->cpumask
gets cleared after passing this point:
if (WARN_ONCE(!mask, "empty IPI mask"))
return;
then the problem in commit 83d349f35e ("x86: don't send an IPI to
the empty set of CPU's") will happen again.
Signed-off-by: Wang YanQing <udknight@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: peterz@infradead.org
Cc: mina86@mina86.org
Cc: srivatsa.bhat@linux.vnet.ibm.com
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/20130126075357.GA3205@udknight
[ Tidied up the changelog and the comment in the code. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While remotely reading the cputime of a task running in a
full dynticks CPU, the values stored in utime/stime fields
of struct task_struct may be stale. Its values may be those
of the last kernel <-> user transition time snapshot and
we need to add the tickless time spent since this snapshot.
To fix this, flush the cputime of the dynticks CPUs on
kernel <-> user transition and record the time / context
where we did this. Then on top of this snapshot and the current
time, perform the fixup on the reader side from task_times()
accessors.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
[fixed kvm module related build errors]
Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>
Do some ground preparatory work before adding guest_enter()
and guest_exit() context tracking callbacks. Those will
be later used to read the guest cputime safely when we
run in full dynticks mode.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
This is in preparation for the full dynticks feature. While
remotely reading the cputime of a task running in a full
dynticks CPU, we'll need to do some extra-computation. This
way we can account the time it spent tickless in userspace
since its last cputime snapshot.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Allow to dynamically switch between tick and virtual based
cputime accounting. This way we can provide a kind of "on-demand"
virtual based cputime accounting. In this mode, the kernel relies
on the context tracking subsystem to dynamically probe on kernel
boundaries.
This is in preparation for being able to stop the timer tick in
more places than just the idle state. Doing so will depend on
CONFIG_VIRT_CPU_ACCOUNTING_GEN which makes it possible to account
the cputime without the tick by hooking on kernel/user boundaries.
Depending whether the tick is stopped or not, we can switch between
tick and vtime based accounting anytime in order to minimize the
overhead associated to user hooks.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
If we want to stop the tick further idle, we need to be
able to account the cputime without using the tick.
Virtual based cputime accounting solves that problem by
hooking into kernel/user boundaries.
However implementing CONFIG_VIRT_CPU_ACCOUNTING require
low level hooks and involves more overhead. But we already
have a generic context tracking subsystem that is required
for RCU needs by archs which plan to shut down the tick
outside idle.
This patch implements a generic virtual based cputime
accounting that relies on these generic kernel/user hooks.
There are some upsides of doing this:
- This requires no arch code to implement CONFIG_VIRT_CPU_ACCOUNTING
if context tracking is already built (already necessary for RCU in full
tickless mode).
- We can rely on the generic context tracking subsystem to dynamically
(de)activate the hooks, so that we can switch anytime between virtual
and tick based accounting. This way we don't have the overhead
of the virtual accounting when the tick is running periodically.
And one downside:
- There is probably more overhead than a native virtual based cputime
accounting. But this relies on hooks that are already set anyway.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
If the architecture doesn't provide an implementation of
nsecs_to_cputime(), the cputime accounting core uses a
default one that converts the nanoseconds to jiffies. However
this only makes sense if we use the jiffies based cputime.
For now it doesn't matter much because this API is only
called on code that uses jiffies based cputime accounting.
But the code may evolve and this API may be used more
broadly in the future. Keeping this default implementation
around is very error prone as it may introduce a bug and
hide it on architectures that don't override this API.
Fix this by moving this definition to the jiffies based
cputime headers as it is the only place where it belongs to.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
We scale stime, utime values based on rtime (sum_exec_runtime
converted to jiffies). During scaling we multiple rtime * utime,
which seems to be fine, since both values are converted to u64,
but it's not.
Let assume HZ is 1000 - 1ms tick. Process consist of 64 threads,
run for 1 day, threads utilize 100% cpu on user space. Machine
has 64 cpus.
Process rtime = utime will be 64 * 24 * 60 * 60 * 1000 jiffies,
which is 0x149970000. Multiplication rtime * utime result is
0x1a855771100000000, which can not be covered in 64 bits.
Result of overflow is stall of utime values visible in user
space (prev_utime in kernel), even if application still consume
lot of CPU time.
A solution to solve this is to perform the multiplication on
stime instead of utime. It's easy to grow the utime value fast
with a CPU bound thread in userspace for example. Now we assume
that doing so with stime is much harder. In most cases a task
shouldn't ever spend much time in kernel space as it tends to
sleep waiting for jobs completion when they take long to
achieve. IO is the typical example of that.
Hence scaling the cputime by performing the multiplication on
stime instead of utime should considerably reduce the chances of
an overflow on most workloads.
This is largely inspired by a patch from Stanislaw Gruszka:
http://lkml.kernel.org/r/20130107113144.GA7544@redhat.com
Inspired-by: Stanislaw Gruszka <sgruszka@redhat.com>
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1359217182-25184-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When I initially wrote the code for /proc/<pid>/uid_map. I was lazy
and avoided duplicate mappings by the simple expedient of ensuring the
first number in a new extent was greater than any number in the
previous extent.
Unfortunately that precludes a number of valid mappings, and someone
noticed and complained. So use a simple check to ensure that ranges
in the mapping extents don't overlap.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When freeing a deeply nested user namespace free_user_ns calls
put_user_ns on it's parent which may in turn call free_user_ns again.
When -fno-optimize-sibling-calls is passed to gcc one stack frame per
user namespace is left on the stack, potentially overflowing the
kernel stack. CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls
so we can't count on gcc to optimize this code.
Remove struct kref and use a plain atomic_t. Making the code more
flexible and easier to comprehend. Make the loop in free_user_ns
explict to guarantee that the stack does not overflow with
CONFIG_FRAME_POINTER enabled.
I have tested this fix with a simple program that uses unshare to
create a deeply nested user namespace structure and then calls exit.
With 1000 nesteuser namespaces before this change running my test
program causes the kernel to die a horrible death. With 10,000,000
nested user namespaces after this change my test program runs to
completion and causes no harm.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Pointed-out-by: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This subsystem lacks many explanations on its purpose and
design. Add these missing comments.
v4: Document function parameter to be more kernel-doc
friendly, as per Namhyung suggestion.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
As context tracking subsystem evolved, it stopped using ignore_user_qs
and in_user defined in the rcu_dynticks structure. This commit therefore
removes them.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Small grammar fix in rcutree comment regarding 'rcu_scheduler_active'
var.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Export the context state: whether we run in user / kernel
from the context tracking subsystem point of view.
This is going to be used by the generic virtual cputime
accounting subsystem that is needed to implement the full
dynticks.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
__this_cpu_inc_return() or __this_cpu_dec generates a single instruction,
which is faster than __get_cpu_var operation.
Link: http://lkml.kernel.org/r/50A9C1BD.1060308@gmail.com
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Shan Wei <davidshan@tencent.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The text in Documentation said it would be removed in 2.6.41;
the text in the Kconfig said removal in the 3.1 release. Either
way you look at it, we are well past both, so push it off a cliff.
Note that the POWER_CSTATE and the POWER_PSTATE are part of the
legacy tracing API. Remove all tracepoints which use these flags.
As can be seen from context, most already have a trace entry via
trace_cpu_idle anyways.
Also, the cpufreq/cpufreq.c PSTATE one is actually unpaired, as
compared to the CSTATE ones which all have a clear start/stop.
As part of this, the trace_power_frequency also becomes orphaned,
so it too is deleted.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There's no need to test whether a (delayed) work item is pending
before queueing, flushing or cancelling it, so remove work_pending()
tests used in those cases.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
9fdb04cdc5 ("async: replace list of active domains with global list
of pending items") added a struct list_head global_list in struct
async_entry, which isn't initialised. This means that if
!domain->registered at __async_schedule(), then list_del_init() will
be called on the list head in async_run_entry_fn with both pointers
NULL, causing a crash. This is fixed by initialising both the
global_list and domain_list list_heads after kzalloc'ing the entry.
This was noticed due to dapm_power_widgets() which uses
ASYNC_DOMAIN_EXCLUSIVE, which initialises the domain->registered to 0.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: James Hogan <james.hogan@imgtec.com>
Reported-by: Stephen Warren <swarren@wwwdotorg.org>
The type returned from atomic64_t can be either unsigned
long or unsigned long long, depending on the architecture.
Using a cast to unsigned long long lets us use the same
format string for all architectures.
Without this patch, building with scheduler debugging
enabled results in:
kernel/sched/debug.c: In function 'print_cfs_rq':
kernel/sched/debug.c:225:2: warning: format '%ld' expects argument of type 'long int', but argument 4 has type 'long long int' [-Wformat]
kernel/sched/debug.c:225:2: warning: format '%ld' expects argument of type 'long int', but argument 3 has type 'long long int' [-Wformat]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: linux-arm-kernel@list.infradead.org
Link: http://lkml.kernel.org/r/1359123276-15833-7-git-send-email-arnd@arndb.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
a4c96ae319 "sched: Unthrottle rt runqueues in
__disable_runtime()" turned the unthrottle_offline_cfs_rqs
function into a static symbol, which now triggers a warning
about it being potentially unused:
kernel/sched/fair.c:2055:13: warning: 'unthrottle_offline_cfs_rqs' defined but not used [-Wunused-function]
Marking it __maybe_unused shuts up the gcc warning and lets the
compiler safely drop the function body when it's not being used.
To reproduce, build the ARM bcm2835_defconfig.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Boonstoppel <pboonstoppel@nvidia.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: linux-arm-kernel@list.infradead.org
Link: http://lkml.kernel.org/r/1359123276-15833-6-git-send-email-arnd@arndb.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The issue below was found in 2.6.34-rt rather than mainline rt
kernel, but the issue still exists upstream as well.
So please let me describe how it was noticed on 2.6.34-rt:
On this version, each softirq has its own thread, it means there
is at least one RT FIFO task per cpu. The priority of these
tasks is set to 49 by default. If user launches an RT FIFO task
with priority lower than 49 of softirq RT tasks, it's possible
there are two RT FIFO tasks enqueued one cpu runqueue at one
moment. By current strategy of balancing RT tasks, when it comes
to RT tasks, we really need to put them off to a CPU that they
can run on as soon as possible. Even if it means a bit of cache
line flushing, we want RT tasks to be run with the least latency.
When the user RT FIFO task which just launched before is
running, the sched timer tick of the current cpu happens. In this
tick period, the timeout value of the user RT task will be
updated once. Subsequently, we try to wake up one softirq RT
task on its local cpu. As the priority of current user RT task
is lower than the softirq RT task, the current task will be
preempted by the higher priority softirq RT task. Before
preemption, we check to see if current can readily move to a
different cpu. If so, we will reschedule to allow the RT push logic
to try to move current somewhere else. Whenever the woken
softirq RT task runs, it first tries to migrate the user FIFO RT
task over to a cpu that is running a task of lesser priority. If
migration is done, it will send a reschedule request to the found
cpu by IPI interrupt. Once the target cpu responds the IPI
interrupt, it will pick the migrated user RT task to preempt its
current task. When the user RT task is running on the new cpu,
the sched timer tick of the cpu fires. So it will tick the user
RT task again. This also means the RT task timeout value will be
updated again. As the migration may be done in one tick period,
it means the user RT task timeout value will be updated twice
within one tick.
If we set a limit on the amount of cpu time for the user RT task
by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted
upon reaching the soft limit.
But exactly when the SIGXCPU signal should be sent depends on the
RT task timeout value. In fact the timeout mechanism of sending
the SIGXCPU signal assumes the RT task timeout is increased once
every tick.
However, currently the timeout value may be added twice per
tick. So it results in the SIGXCPU signal being sent earlier
than expected.
To solve this issue, we prevent the timeout value from increasing
twice within one tick time by remembering the jiffies value of
last updating the timeout. As long as the RT task's jiffies is
different with the global jiffies value, we allow its timeout to
be updated.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Fan Du <fan.du@windriver.com>
Reviewed-by: Yong Zhang <yong.zhang0@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1342508623-2887-1-git-send-email-ying.xue@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the system has multiple domains do_sched_rt_period_timer()
can run on any CPU and may iterate over all rt_rq in
cpu_online_mask. This means when balance_runtime() is run for a
given rt_rq that rt_rq may be in a different rd than the current
processor. Thus if we use smp_processor_id() to get rd in
do_balance_runtime() we may borrow runtime from a rt_rq that is
not part of our rd.
This changes do_balance_runtime to get the rd from the passed in
rt_rq ensuring that we borrow runtime only from the correct rd
for the given rt_rq.
This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime
when we try reclaim runtime lent to other rt_rq but runtime has
been lent to a rt_rq in another rd.
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Mike Galbraith <bitbucket@online.de>
Cc: peterz@infradead.org
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/1358186131-29494-1-git-send-email-sbohrer@rgmadvisors.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since commit 48ddbe1946
("cgroup: make css->refcnt clearing on cgroup removal optional"),
each css holds a ref on cgroup's dentry, so cgroup_diput() won't be
called until all css' refs go down to 0, which invalids the comments.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Free cgroup via call_rcu(). The actual work is done through
workqueue.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When destroying a cgroup, though in cgroup_diput() we've called
synchronize_rcu(), we then still have to free it via call_rcu().
The story is, long ago to fix a race between reading /proc/sched_debug
and freeing cgroup, the code was changed to utilize call_rcu(). See
commit a47295e6bc ("cgroups: make
cgroup_path() RCU-safe")
As we've fixed cpu cgroup that cpu_cgroup_offline_css() is used
to unregister a task_group so there won't be concurrent access
to this task_group after synchronize_rcu() in diput(). Now we can
just kfree(cgrp).
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
A task_group won't be online (thus no one can see it) until
cpu_cgroup_css_online(), and at that time tg->css.cgroup has
been initialized, so this NULL check is redundant.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This is a preparaton for later patches.
- What do we gain from cpu_cgroup_css_online():
After ss->css_alloc() and before ss->css_online(), there's a small
window that tg->css.cgroup is NULL. With this change, tg won't be seen
before ss->css_online(), where it's added to the global list, so we're
guaranteed we'll never see NULL tg->css.cgroup.
- What do we gain from cpu_cgroup_css_offline():
tg is freed via RCU, so is cgroup. Without this change, This is how
synchronization works:
cgroup_rmdir()
no ss->css_offline()
diput()
syncornize_rcu()
ss->css_free() <-- unregister tg, and free it via call_rcu()
kfree_rcu(cgroup) <-- wait possible refs to cgroup, and free cgroup
We can't just kfree(cgroup), because tg might access tg->css.cgroup.
With this change:
cgroup_rmdir()
ss->css_offline() <-- unregister tg
diput()
synchronize_rcu() <-- wait possible refs to tg and cgroup
ss->css_free() <-- free tg
kfree_rcu(cgroup) <-- free cgroup
As you see, kfree_rcu() is redundant now.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
With this change, we're guaranteed that cgroup_path() won't see NULL
cgrp->dentry, and thus we can remove the NULL check in it.
(Well, it's not strictly true, because dummptop.dentry is always NULL
but we already handle that separately.)
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We checked for uprobe==NULL earlier, no need to redo that.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1356030701-16284-22-git-send-email-sasha.levin@oracle.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Remove remaining references to gcwq.
* __next_gcwq_cpu() steals __next_wq_cpu() name. The original
__next_wq_cpu() became __next_cwq_cpu().
* s/for_each_gcwq_cpu/for_each_wq_cpu/
s/for_each_online_gcwq_cpu/for_each_online_wq_cpu/
* s/gcwq_mayday_timeout/pool_mayday_timeout/
* s/gcwq_unbind_fn/wq_unbind_fn/
* Drop references to gcwq in comments.
This patch doesn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Rename per-cpu and unbound nr_running variables such that they match
the pool variables.
This patch doesn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
global_cwq is now nothing but a container for per-cpu standard
worker_pools. Declare the worker pools directly as
cpu/unbound_std_worker_pools[] and remove global_cwq.
* ____cacheline_aligned_in_smp moved from global_cwq to worker_pool.
This probably would have made sense even before this change as we
want each pool to be aligned.
* get_gcwq() is replaced with std_worker_pools() which returns the
pointer to the standard pool array for a given CPU.
* __alloc_workqueue_key() updated to use get_std_worker_pool() instead
of open-coding pool determination.
This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.
v2: Joonsoo pointed out that it'd better to align struct worker_pool
rather than the array so that every pool is aligned.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Joonsoo Kim <js1304@gmail.com>
The only remaining user of pool->gcwq is std_worker_pool_pri().
Reimplement it using get_gcwq() and remove worker_pool->gcwq.
This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
for_each_std_worker_pool() takes @cpu instead of @gcwq.
This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Instead of holding locks from both pools and then processing the pools
together, make freezing/thwaing per-pool - grab locks of one pool,
process it, release it and then proceed to the next pool.
While this patch changes processing order across pools, order within
each pool remains the same. As each pool is independent, this
shouldn't break anything.
This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Instead of holding locks from both pools and then processing the pools
together, make hotplug processing per-pool - grab locks of one pool,
process it, release it and then proceed to the next pool.
rebind_workers() is updated to take and process @pool instead of @gcwq
which results in a lot of de-indentation. gcwq_claim_assoc_and_lock()
and its counterpart are replaced with in-line per-pool locking.
While this patch changes processing order across pools, order within
each pool remains the same. As each pool is independent, this
shouldn't break anything.
This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Move gcwq->lock to pool->lock. The conversion is mostly
straight-forward. Things worth noting are
* In many places, this removes the need to use gcwq completely. pool
is used directly instead. get_std_worker_pool() is added to help
some of these conversions. This also leaves get_work_gcwq() without
any user. Removed.
* In hotplug and freezer paths, the pools belonging to a CPU are often
processed together. This patch makes those paths hold locks of all
pools, with highpri lock nested inside, to keep the conversion
straight-forward. These nested lockings will be removed by
following patches.
This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Move gcwq->cpu to pool->cpu. This introduces a couple places where
gcwq->pools[0].cpu is used. These will soon go away as gcwq is
further reduced.
This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
There's no functional necessity for the two pools on the same CPU to
share the busy hash table. It's also likely to be a bottleneck when
implementing pools with user-specified attributes.
This patch makes busy_hash per-pool. The conversion is mostly
straight-forward. Changes worth noting are,
* Large block of changes in rebind_workers() is moving the block
inside for_each_worker_pool() as now there are separate hash tables
for each pool. This changes the order of operations but doesn't
break anything.
* Thre for_each_worker_pool() loops in gcwq_unbind_fn() are combined
into one. This again changes the order of operaitons but doesn't
break anything.
This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Currently, when a work item is off-queue, work->data records the CPU
it was last on, which is used to locate the last executing instance
for non-reentrance, flushing, etc.
We're in the process of removing global_cwq and making worker_pool the
top level abstraction. This patch makes work->data point to the pool
it was last associated with instead of CPU.
After the previous WORK_OFFQ_POOL_CPU and worker_poo->id additions,
the conversion is fairly straight-forward. WORK_OFFQ constants and
functions are modified to record and read back pool ID instead.
worker_pool_by_id() is added to allow looking up pool from ID.
get_work_pool() replaces get_work_gcwq(), which is reimplemented using
get_work_pool(). get_work_pool_id() replaces work_cpu().
This patch shouldn't introduce any observable behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Add worker_pool->id which is allocated from worker_pool_idr. This
will be used to record the last associated worker_pool in work->data.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Currently, when a work item is off queue, high bits of its data
encodes the last CPU it was on. This is scheduled to be changed to
pool ID, which will make it impossible to use WORK_CPU_NONE to
indicate no association.
This patch limits the number of bits which are used for off-queue cpu
number to 31 (so that the max fits in an int) and uses the highest
possible value - WORK_OFFQ_CPU_NONE - to indicate no association.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Make GCWQ_FREEZING a pool flag POOL_FREEZING. This patch doesn't
change locking - FREEZING on both pools of a CPU are set or clear
together while holding gcwq->lock. It shouldn't cause any functional
difference.
This leaves gcwq->flags w/o any flags. Removed.
While at it, convert BUG_ON()s in freeze_workqueue_begin() and
thaw_workqueues() to WARN_ON_ONCE().
This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Make GCWQ_DISASSOCIATED a pool flag POOL_DISASSOCIATED. This patch
doesn't change locking - DISASSOCIATED on both pools of a CPU are set
or clear together while holding gcwq->lock. It shouldn't cause any
functional difference.
This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
There are currently two worker pools per cpu (including the unbound
cpu) and they are the only pools in use. New class of pools are
scheduled to be added and some pool related APIs will be added
inbetween. Call the existing pools the standard pools and prefix them
with std_. Do this early so that new APIs can use std_ prefix from
the beginning.
This patch doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
This function no longer has any external users. Unexport it. It will
be removed later on.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
init_task.cgroups is initialized at boot phase, and whenver a ask
is forked, it's cgroups pointer is inherited from its parent, and
it's never set to NULL afterwards.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The MSI specification has several constraints in comparison with
MSI-X, most notable of them is the inability to configure MSIs
independently. As a result, it is impossible to dispatch
interrupts from different queues to different CPUs. This is
largely devalues the support of multiple MSIs in SMP systems.
Also, a necessity to allocate a contiguous block of vector
numbers for devices capable of multiple MSIs might cause a
considerable pressure on x86 interrupt vector allocator and
could lead to fragmentation of the interrupt vectors space.
This patch overcomes both drawbacks in presense of IRQ remapping
and lets devices take advantage of multiple queues and per-IRQ
affinity assignments.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/c8bd86ff56b5fc118257436768aaa04489ac0a4c.1353324359.git.agordeev@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reschedule rq->curr if the first RT task has just been
pulled to the rq.
Signed-off-by: Kirill V Tkhai <tkhai@yandex.ru>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tkhai Kirill <tkhai@yandex.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/118761353614535@web28f.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The last remaining user was oprofile and its use has been
removed a while ago in commit bc078e4eab
("oprofile: convert oprofile from timer_hook to hrtimer").
There doesn't seem to be any upstream user of this hook
for about two years now. And I'm not even aware of any out of
tree user.
Let's remove it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1356191991-2251-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The caller of sched_sliced() should pass se.cfs_rq and se as the
arguments, however in sched_rr_get_interval() we gave it
rq.cfs_rq and se, which made the following computation obviously
wrong.
The change was introduced by commit:
77034937dc sched: fix crash in sys_sched_rr_get_interval()
... 5 years ago, while it had been the correct 'cfs_rq_of' before
the commit. The change seems to be irrelevant to the commit
msg, which was to return a 0 timeslice for tasks that are on an
idle runqueue. So I believe that was just a plain typo.
Signed-off-by: Zhu Yanhai <gaoyang.zyh@taobao.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1357621012-15039-1-git-send-email-gaoyang.zyh@taobao.com
[ Since this is an ABI and an old bug, we'll test this via a
slow upstream route, to hopefully discover any app breakage. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dan's smatch found a compare bug with the result of the
trace_test_and_set_recursion() and comparing to less than
zero. If the function fails, it returns -1, but was saved in
an unsigned int, which will never be less than zero and will
ignore the result of the test if a recursion did happen.
Luckily this is the last of the recursion tests, as the
infrastructure of ftrace would catch recursions before it
got here, except for some few exceptions.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull tracing updates from Steve Rostedt.
This commit:
tracing: Remove the extra 4 bytes of padding in events
changes the ABI. All involved parties seem to agree that it's safe to
do now, but the devil is in the details ...
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQEcBAABAgAGBQJQ+MDOAAoJEHm+PkMAQRiGFL8H/39bXl3xs/ZkBct3ciICtaAt
UeRcdjvBdUrObQmmw8FEFJA62Az74g1UYyNyc0yVJoB6DwL2oawXhGXQXqDXMBMs
xJa/dpx41gmcrvsLOJ1v2Tmokk32hgdpLNWwzcn36xZMsrhYUxjxPhVNK4aDbwrW
3uzBj83dK1rvi2PEz3hgHizRhmdD1cNgLXKrczNbT6y6rLp6Jkng2q5UhNfaZUvE
A7Ub5/nW+X6gb1PctAjulz2CoySq0FbH4viUc+zLj3NN0L9ImeJvpREauE2QPVgJ
a8Sg3E6/cL4wIXBEV1E1KapVTd1XIEGa8yB5X6q8HWNKI8nf7Kwuj65VgvwJIMc=
=+VR8
-----END PGP SIGNATURE-----
Merge tag 'v3.8-rc4' into irq/core
Merge Linux 3.8-rc4 before pulling in new commits - we were on an old v3.7 base.
Signed-off-by: Ingo Molnar <mingo@kernel.org>