Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error 7 #1

Closed
vadash opened this issue Sep 30, 2017 · 1 comment
Closed

Error 7 #1

vadash opened this issue Sep 30, 2017 · 1 comment

Comments

@vadash
Copy link

vadash commented Sep 30, 2017

Ошибка 7 при установке на flo + lineage os. Последняя версия 0628 работает нормально

@alesaiko
Copy link
Owner

This is firmware related, kernel itself has nothing to do with it.

alesaiko added a commit that referenced this issue Jan 28, 2018
Revert "binder: Quiet binder"

This reverts commit 390b8ce.

Change-Id: I31dcf564946247d329bf14dc2880a331b9f05716
Signed-off-by: D. Ajay Dudani <[email protected]>

introduce SIZE_MAX

ULONG_MAX is often used to check for integer overflow when calculating
allocation size.  While ULONG_MAX happens to work on most systems, there
is no guarantee that `size_t' must be the same size as `long'.

This patch introduces SIZE_MAX, the maximum value of `size_t', to improve
portability and readability for allocation size validation.

Signed-off-by: Xi Wang <[email protected]>
Acked-by: Alex Elder <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Pekka Enberg <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Staging: android: binder: Make task_get_unused_fd_flags function static

Silence the following warning:
drivers/staging/android/binder.c:368:5: warning:
symbol 'task_get_unused_fd_flags' was not declared. Should it be static?

Change-Id: Iacdae492c73d3b0399d2cf0d101943313082de0d
Cc: Arve Hjønnevåg <[email protected]>
Signed-off-by: Sachin Kamat <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

drivers: staging: android: binder.c: fix printk macros

Change printk() messages to pr_* macros.

Change-Id: Iaeddb5f0697bf25abc3d860cfdc431a0a7125d7f
Signed-off-by: Sherwin Soltani <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

BACKPORT: Staging: android: binder: Fixed multi-line strings

Changed all user visible multi-line strings to single line.
Removed 'binder:' prefix on stings.

Change-Id: I697fa4ee9741e2893f08062ca2256985f4977739
Signed-off-by: Anmol Sarma <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

staging: android: binder: modify struct binder_write_read to use size_t

This change mirrors the userspace operation where struct binder_write_read
members that specify the buffer size and consumed size are size_t elements.

The patch also fixes the binder_thread_write() and binder_thread_read()
functions prototypes to conform with the definition of binder_write_read.

The changes do not affect existing 32bit ABI.

Signed-off-by: Serban Constantinescu <[email protected]>
Acked-by: Arve Hjønnevåg <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Git-commit: 397334fc2be6a7e2f77474bd2b24880efea007bf
Git-repo: https://android.googlesource.com/kernel/common/
Change-Id: If606f0fe135ffc4a630dbf34d755f559c36ee62a
Signed-off-by: Neeti Desai <[email protected]>
Signed-off-by: Ajay Dudani <[email protected]>

staging: android: binder: fix sparse warnings

Fix sparse warnings by adding __user annotation to stucts.

This patch fixes the the following sparse warnings:

drivers/staging/android/binder.c:1343:76: warning:
	incorrect type in argument 2 (different address spaces)
	drivers/staging/android/binder.c:1343:76:
	expected void [noderef] <asn:1>*ptr
	drivers/staging/android/binder.c:1343:76: got void *binder
drivers/staging/android/binder.c:1567:57: warning:
	incorrect type in argument 2 (different address spaces)
	drivers/staging/android/binder.c:1567:57:
	expected void const [noderef] <asn:1>*from
	drivers/staging/android/binder.c:1567:57:
	got void const *buffer
drivers/staging/android/binder.c:1573:46: warning:
	incorrect type in argument 2 (different address spaces)
	drivers/staging/android/binder.c:1573:46:
	expected void const [noderef] <asn:1>*from
	drivers/staging/android/binder.c:1573:46:
	got void const *offsets
drivers/staging/android/binder.c:1603:76: warning:
	incorrect type in argument 2 (different address spaces)
	drivers/staging/android/binder.c:1603:76:
	expected void [noderef] <asn:1>*ptr
	drivers/staging/android/binder.c:1603:76: got void *binder
drivers/staging/android/binder.c:1605:64: warning:
	incorrect type in argument 2 (different address spaces)
	drivers/staging/android/binder.c:1605:64:
	expected void [noderef] <asn:1>*ptr
	drivers/staging/android/binder.c:1605:64: got void *binder
drivers/staging/android/binder.c:1605:76: warning:
	incorrect type in argument 3 (different address spaces)
	drivers/staging/android/binder.c:1605:76:
	expected void [noderef] <asn:1>*cookie
	drivers/staging/android/binder.c:1605:76: got void *cookie
drivers/staging/android/binder.c:1613:40: error:
	incompatible types in comparison

Change-Id: I6cac879e87993f077700574420a798226a21721d
Signed-off-by: Emil Goode <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

staging: android: Avoid using camelcase in binder.h

This changes the following:

1: BinderDriverReturnProtocol -> binder_driver_return_protocol
2: BinderDriverCommandProtocol -> binder_driver_return_protocol

These enums are not currently used, but still generate noise in checkpatch.

Well, did. They don't now :)

Change-Id: I7eeb7b8fc20ed1c4b3736f3f36b6637a1a631560
Signed-off-by: Cruz Julian Bishop <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

staging: android: binder: fix alignment issues

The Android userspace aligns the data written to the binder buffers to
4bytes. Thus for 32bit platforms or 64bit platforms running an 32bit
Android userspace we can have a buffer looking like this:

platform    buffer(binder_cmd   pointer)      size
32/32                 32b         32b          8B
64/32                 32b         64b          12B
64/64                 32b         64b          12B

Thus the kernel needs to check that the buffer size is aligned to 4bytes
not to (void *) that will be 8bytes on 64bit machines.

The change does not affect existing 32bit ABI.

Change-Id: Idcad35da1c1567ee0676d60d03afd07b219c59ea
Signed-off-by: Serban Constantinescu <[email protected]>
Acked-by: Arve Hjønnevåg <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Git-commit: ec35e852dc9de9809f88ff397d7a611208880f9f
Git-repo: https://android.googlesource.com/kernel/common/
Signed-off-by: Neeti Desai <[email protected]>
Signed-off-by: Ajay Dudani <[email protected]>

Staging: android: Mark local functions in binder.c as static

This fixes the following sparse warnings
drivers/staging/android/binder.c:1703:5: warning: symbol 'binder_thread_write' was not declared. Should it be static?
drivers/staging/android/binder.c:2058:6: warning: symbol 'binder_stat_br' was not declared. Should it be static?

Change-Id: I930f10e54c19b0c6aca275f3ef51320bcfa3bb34
Signed-off-by: Bojan Prtvar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Git-commit: fb07ebc3e82a98a3605112b71ea819c359549c4b
Git-repo: https://android.googlesource.com/kernel/common/
Signed-off-by: Neeti Desai <[email protected]>
Signed-off-by: Ajay Dudani <[email protected]>

Staging: android: add __user annotation in binder.c

This fixes the following sparse error
drivers/staging/android/binder.c:1795:36: error: incompatible types in comparison expression (different address spaces)

Change-Id: Icf6b3868442fc3e4f145c8d13de9626abc306cd1
Signed-off-by: Bojan Prtvar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Git-commit: 308fbd8ac0b0078dba29cad027e5b454aac13a6a
Git-repo: https://android.googlesource.com/kernel/common/
Signed-off-by: Neeti Desai <[email protected]>
Signed-off-by: Ajay Dudani <[email protected]>

staging: android: binder: replace IOCTL types with user-exportable types

This patch modifies the IOCTL macros to use user-exportable data types,
as they are the referred kernel types for the user/kernel interface.

The patch does not change in any way the functionality of the binder driver.

Change-Id: I784358581eba5c04c9bb3235cd4ae68f0225129a
Signed-off-by: Serban Constantinescu <[email protected]>
Acked-by: Arve Hjønnevåg <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

staging: android: binder: fix BINDER_SET_MAX_THREADS declaration

This change will fix the BINDER_SET_MAX_THREADS ioctl to use __u32
instead of size_t for setting the max threads. Thus using the same
handler for 32 and 64bit kernels.

This value is stored internally in struct binder_proc and set to 15
on open_binder() in the libbinder API(thus no need for a 64bit size_t
on 64bit platforms).

The change does not affect existing 32bit ABI.

Change-Id: I193678d455b6527d54c524feb785631df8faed5a
Signed-off-by: Serban Constantinescu <[email protected]>
Acked-by: Arve Hjønnevåg <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Git-commit: a9350fc859ae3f1db7f2efd55c7a3e0d09a4098d
Git-repo: https://android.googlesource.com/kernel/common/
Signed-off-by: Neeti Desai <[email protected]>
Signed-off-by: Ajay Dudani <[email protected]>

staging: android: binder: fix BC_FREE_BUFFER ioctl declaration

BinderDriverCommands mirror the ioctl usage. Thus the size of the
structure passed through the interface should be used to generate the
ioctl No.

The change reflects the type being passed from the user space-a pointer
to a binder_buffer. This change should not affect the existing 32bit
user space since BC_FREE_BUFFER is computed as:

   #define _IOW(type,nr,size)         \
      ((type) << _IOC_TYPESHIFT) |    \
      ((nr)   << _IOC_NRSHIFT) |      \
      ((size) << _IOC_SIZESHIFT))

and for a 32bit compiler BC_FREE_BUFFER will have the same computed
value. This change will also ease our work in differentiating
BC_FREE_BUFFER from COMPAT_BC_FREE_BUFFER.

The change does not affect existing 32bit ABI.

Change-Id: I72c6bfae325840a825c8786a79a07ffad540d602
Signed-off-by: Serban Constantinescu <[email protected]>
Acked-by: Arve Hjønnevåg <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Git-commit: fc56f2ecf091d774c18ad0d470c62c6818fa32a3
Git-repo: https://android.googlesource.com/kernel/common/
Signed-off-by: Neeti Desai <[email protected]>
Signed-off-by: Ajay Dudani <[email protected]>

staging: android: binder: replace types with portable ones

Since this driver is meant to be used on different types of processors
and a portable driver should specify the size a variable expects to be
this patch changes the types used throughout the binder interface.

We use "userspace" types since this header will be exported and used by
the Android filesystem.

The patch does not change in any way the functionality of the binder driver.

Change-Id: Iede6575f6f9d76bec0bbed11948abe3ff081d0ee
Signed-off-by: Serban Constantinescu <[email protected]>
Acked-by: Arve Hjønnevåg <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Git-commit: eecddef594f9eb159040160b929642f16a07387f
Git-repo: https://android.googlesource.com/kernel/common/
Signed-off-by: Neeti Desai <[email protected]>
Signed-off-by: Ajay Dudani <[email protected]>

staging: android: binder: fix binder interface for 64bit compat layer

The changes in this patch will fix the binder interface for use on 64bit
machines and stand as the base of the 64bit compat support. The changes
apply to the structures that are passed between the kernel and
userspace.

Most of the  changes applied mirror the change to struct binder_version
where there is no need for a 64bit wide protocol_version(on 64bit
machines). The change inlines with the existing 32bit userspace(the
structure has the same size) and simplifies the compat layer such that
the same handler can service the BINDER_VERSION ioctl.

Other changes make use of kernel types as well as user-exportable ones
and fix format specifier issues.

The changes do not affect existing 32bit ABI.

Change-Id: If00cb82dc4407a5e0890abbcb4019883e99e9a1f
Signed-off-by: Serban Constantinescu <[email protected]>
Acked-by: Arve Hjønnevåg <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Git-commit: 64dcfe6b84d4104d93e4baf2b5a0b3e7f2e4cc30
Git-repo: https://android.googlesource.com/kernel/common/
Signed-off-by: Neeti Desai <[email protected]>
Signed-off-by: Ajay Dudani <[email protected]>

drivers: staging: android: split uapi out of binder.h

Move the userspace interface of binder.h to
drivers/staging/android/uapi/binder.h.

Change-Id: I2e56ba89ade5e1f33b121e6ecd456392d588a14e
Signed-off-by: Colin Cross <[email protected]>
Git-commit: 06f505a4d5719da00e76bde885792a7d5ec968f8
Git-repo: https://android.googlesource.com/kernel/common/
Signed-off-by: Neeti Desai <[email protected]>
Signed-off-by: Ajay Dudani <[email protected]>

staging: android: binder: fix ABI for 64bit Android

This patch fixes the ABI for 64bit Android userspace.
BC_REQUEST_DEATH_NOTIFICATION and BC_CLEAR_DEATH_NOTIFICATION claim
to be using struct binder_ptr_cookie, but they are using a 32bit handle
and a pointer.

On 32bit systems the payload size is the same as the size of struct
binder_ptr_cookie, however for 64bit systems this will differ. This
patch adds struct binder_handle_cookie that fixes this issue for 64bit
Android.

Since there are no 64bit users of this interface that we know of this
change should not affect any existing systems.

Change-Id: I8909cbc50aad48ccf371270bad6f69ff242a8c22
Signed-off-by: Serban Constantinescu <[email protected]>
Git-commit: 34d977e7af9bb097530aa71204d591485f7dddc7
Git-repo: https://android.googlesource.com/kernel/common/
Signed-off-by: Neeti Desai <[email protected]>
Signed-off-by: David Ng <[email protected]>
Signed-off-by: Ajay Dudani <[email protected]>

binder: prevent kptr leak by using %pK format specifier

Works in conjunction with kptr_restrict.
Bug: 30143283

Change-Id: I2b3ce22f4e206e74614d51453a1d59b7080ab05a

staging: binder: add vm_fault handler

An issue was observed when a userspace task exits.
The page which hits error here is the zero page.
In binder mmap, the whole of vma is not mapped.
On a task crash, when debuggerd reads the binder regions,
the unmapped areas fall to do_anonymous_page in handle_pte_fault,
due to the absence of a vm_fault handler. This results in
zero page being mapped. Later in zap_pte_range, vm_normal_page
returns zero page in the case of VM_MIXEDMAP and it results in the
error.

BUG: Bad page map in process mediaserver  pte:9dff379f pmd:9bfbd831
page:c0ed8e60 count:1 mapcount:-1 mapping:  (null) index:0x0
page flags: 0x404(referenced|reserved)
addr:40c3f000 vm_flags:10220051 anon_vma:  (null) mapping:d9fe0764 index:fd
vma->vm_ops->fault:   (null)
vma->vm_file->f_op->mmap: binder_mmap+0x0/0x274
CPU: 0 PID: 1463 Comm: mediaserver Tainted: G        W    3.10.17+ #1
[<c001549c>] (unwind_backtrace+0x0/0x11c) from [<c001200c>] (show_stack+0x10/0x14)
[<c001200c>] (show_stack+0x10/0x14) from [<c0103d78>] (print_bad_pte+0x158/0x190)
[<c0103d78>] (print_bad_pte+0x158/0x190) from [<c01055f0>] (unmap_single_vma+0x2e4/0x598)
[<c01055f0>] (unmap_single_vma+0x2e4/0x598) from [<c010618c>] (unmap_vmas+0x34/0x50)
[<c010618c>] (unmap_vmas+0x34/0x50) from [<c010a9e4>] (exit_mmap+0xc8/0x1e8)
[<c010a9e4>] (exit_mmap+0xc8/0x1e8) from [<c00520f0>] (mmput+0x54/0xd0)
[<c00520f0>] (mmput+0x54/0xd0) from [<c005972c>] (do_exit+0x360/0x990)
[<c005972c>] (do_exit+0x360/0x990) from [<c0059ef0>] (do_group_exit+0x84/0xc0)
[<c0059ef0>] (do_group_exit+0x84/0xc0) from [<c0066de0>] (get_signal_to_deliver+0x4d4/0x548)
[<c0066de0>] (get_signal_to_deliver+0x4d4/0x548) from [<c0011500>] (do_signal+0xa8/0x3b8)

Add a vm_fault handler which returns VM_FAULT_SIGBUS, and prevents the
wrong fallback to do_anonymous_page.

CRs-Fixed: 673147
Change-Id: I43730a51b6c819538b46c5e4dc5c96c8a384098d
Signed-off-by: Vinayak Menon <[email protected]>
Patch-mainline: linux-arm-kernel @ 06/02/14, 18:17
Signed-off-by: Vignesh Radhakrishnan <[email protected]>
Signed-off-by: Subbaraman Narayanamurthy <[email protected]>

Staging: android: binder: Support concurrent 32 bit and 64 bit processes.

Add binder_size_t and binder_uintptr_t that is used instead of size_t and
void __user * in the user-space interface.

Use 64 bit pointers on all systems unless CONFIG_ANDROID_BINDER_IPC_32BIT
is set (which enables the old protocol on 32 bit systems).

Change BINDER_CURRENT_PROTOCOL_VERSION to 8 if
CONFIG_ANDROID_BINDER_IPC_32BIT is not set.

Add compat ioctl.

Change-Id: Ifbbde0209da0050011bcab34c547a4c30d6e8c49
Signed-off-by: Arve Hjønnevåg <[email protected]>
Git-commit: 1c4aa9fb12e8b0a54f056b8402b0bde61b49498f
Git-repo: https://android.googlesource.com/kernel/common/
Signed-off-by: Neeti Desai <[email protected]>
Signed-off-by: David Ng <[email protected]>
Signed-off-by: Ajay Dudani <[email protected]>

binder: Quiet Binder

Temporary change to avoid watchdog bark because of
excessive failed transaction logging

CRs-Fixed: 572081

Change-Id: Id664d65ab9e78627991f8b7d4f4e5e126908c214
Signed-off-by: Uma Maheshwari Bhiram <[email protected]>

binder: NULL pointer reference

This fix handles a possible NULL pointer reference in
debug message.

CRs-fixed: 642883
Change-Id: Ide78f281ec0cff5cbd8231b85c305d13a892854e
Signed-off-by: Paresh Nakhe <[email protected]>

Staging: android: binder: More offset validation.

Make sure offsets don't point to overlapping flat_binder_object
structs.

Change-Id: I425ab0c46fbe2b00ed679c5becf9e8140395eb40
Signed-off-by: Arve Hjønnevåg <[email protected]>
Git-commit: 457c3cd05958b8397211ae1f6dd3c3d325f4c0ea
Git-repo: https://android.googlesource.com/kernel/common/
Signed-off-by: Neeti Desai <[email protected]>
Signed-off-by: Ajay Dudani <[email protected]>

drivers: android: binder: Fix code style in binder_deferred_release

 * Use tabs where applicable
 * Remove a few "80-columns" checkpatch warnings
 * Separate code paths with empty lines for readability

Change-Id: I634852d0812756e2c0412152a36c99dd9a9bb94a
Signed-off-by: Mirsal Ennaime <[email protected]>
Reviewed-by: Dan Carpenter <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

drivers: android: binder: Remove excessive indentation

Remove one level of indentation from the binder proc page release code
by using slightly different control semantics.

Change-Id: I7a34049bf32799d7954da770f05411183c950778
Signed-off-by: Mirsal Ennaime <[email protected]>
Reviewed-by: Dan Carpenter <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

drivers: android: binder: Use __func__ in debug messages

Debug messages sent in binder_deferred_release begin with
"binder_release:" which is a bit misleading as binder_release is not
directly part of the call stack. Use __func__ instead for debug messages
in binder_deferred_release.

Change-Id: I7b9e2efaed188328d5b0dc82fbfe314a3666237c
Signed-off-by: Mirsal Ennaime <[email protected]>
Reviewed-by: Dan Carpenter <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

staging: android: Fix typo in staging/android

Fix "with with" in debug message.

Change-Id: Ibb60ca741d8ec760873054db53ad83e1b8a70c15
Signed-off-by: Masanari Iida <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Git-commit: 1dcdbfd6d9a5172ece7ccccbca90531d4cf62083
Git-repo: https://android.googlesource.com/kernel/common/
Signed-off-by: Neeti Desai <[email protected]>
Signed-off-by: Ajay Dudani <[email protected]>

binder: blacklist %p kptr_restrict

Bug: 31495231
Change-Id: Iebc150f6bc939b56e021424ee44fb30ce8d732fd

android: binder: split flat_binder_object.

flat_binder_object is used for both handling
binder objects and file descriptors, even though
the two are mostly independent. Since we'll
have more fixup objects in binder in the future,
instead of extending flat_binder_object again,
split out file descriptors to their own object
while retaining backwards compatibility to
existing user-space clients. All binder objects
just share a header.

Change-Id: Ifffa8cb749335d0ee79226c98f70786190516355
Signed-off-by: Martijn Coenen <[email protected]>

BACKPORT: android: binder: support multiple context managers.

Move the context manager state into a separate
struct context, and allow for each process to have
its own context associated with it.

Change-Id: I6a9dfacb7b73a29760e367ff0b4e0ee21f2d0380
Signed-off-by: Martijn Coenen <[email protected]>
Signed-off-by: Adrian DC <[email protected]>

BACKPORT: android: binder: deal with contexts in debugfs.

Properly print the context in debugfs entries.

Change-Id: Ieeb89bfa8e760635366ce8b60569fbbd4937b844
Signed-off-by: Martijn Coenen <[email protected]>

BACKPORT: android: binder: support multiple /dev instances.

Add a new module parameter 'devices', that can be
used to specify the names of the binder device
nodes we want to populate in /dev.

Each device node has its own context manager, and
is therefore logically separated from all the other
device nodes.

The config option CONFIG_ANDROID_BINDER_DEVICES can
be used to set the default value of the parameter.

This approach was favored over using IPC namespaces,
mostly because we require a single process to be a
part of multiple binder contexts, which seemed harder
to achieve with namespaces.

[AdrianDC] Backport to 3.4 with list.h iterator
           hlist_for_each_entry_rcu node kept

Change-Id: I3d8531c44e82ef7db4d8b9fa0c1761d4ec282e3d
Signed-off-by: Martijn Coenen <[email protected]>
Signed-off-by: Adrian DC <[email protected]>

android: binder: refactor binder_transact()

Moved handling of fixup for binder objects,
handles and file descriptors into separate
functions.

Change-Id: If6849f1caee3834aa87d0ab08950bb1e21ec6e38
Signed-off-by: Martijn Coenen <[email protected]>

android: binder: add extra size to allocator.

The binder_buffer allocator currently only allocates
space for the data and offsets buffers of a Parcel.
This change allows for requesting an additional chunk
of data in the buffer, which can for example be used
to hold additional meta-data about the transaction
(eg a security context).

Change-Id: I58ab9c383a2e1a3057aae6adaa596ce867f1b157
Signed-off-by: Martijn Coenen <[email protected]>

android: binder: support for scatter-gather.

Previously all data passed over binder needed
to be serialized, with the exception of Binder
objects and file descriptors.

This patchs adds support for scatter-gathering raw
memory buffers into a binder transaction, avoiding
the need to first serialize them into a Parcel.

To remain backwards compatibile with existing
binder clients, it introduces two new command
ioctls for this purpose - BC_TRANSACTION_SG and
BC_REPLY_SG. These commands may only be used with
the new binder_transaction_data_sg structure,
which adds a field for the total size of the
buffers we are scatter-gathering.

Because memory buffers may contain pointers to
other buffers, we allow callers to specify
a parent buffer and an offset into it, to indicate
this is a location pointing to the buffer that
we are fixing up. The kernel will then take care
of fixing up the pointer to that buffer as well.

Change-Id: Ibe17f4f5629d1d541a03f1e826cfd153b64f0d8c
Signed-off-by: Martijn Coenen <[email protected]>

android: binder: support for file-descriptor arrays.

This patch introduces a new binder_fd_array object,
that allows us to support one or more file descriptors
embedded in a buffer that is scatter-gathered.

Change-Id: I647a53cf0d905c7be0dfd9333806982def68dd74
Signed-off-by: Martijn Coenen <[email protected]>

binder: don't allow mmap() by process other than proc->tsk

we really shouldn't do get_files_struct() on a different process
and use it to modify the sucker later on.

Change-Id: I2be2b99395b6efa85a007317b25e6e9e7953c47a
Signed-off-by: Al Viro <[email protected]>

binder: use group leader instead of open thread

The binder allocator assumes that the thread that
called binder_open will never die for the lifetime of
that proc. That thread is normally the group_leader,
however it may not be. Use the group_leader instead
of current.

Bug: 35707103
Test: Created test case to open with temporary thread
Change-Id: Id693f74b3591f3524a8c6e9508e70f3e5a80c588
Signed-off-by: Todd Kjos <[email protected]>

ANDROID: binder: add hwbinder,vndbinder to BINDER_DEVICES.

These will be required going forward.

Change-Id: I8f24e1e9f87a6773bd84fb9f173a3725c376c692
Signed-off-by: Martijn Coenen <[email protected]>

android: binder: add padding to binder_fd_array_object.

binder_fd_array_object starts with a 4-byte header,
followed by a few fields that are 8 bytes when
ANDROID_BINDER_IPC_32BIT=N.

This can cause alignment issues in a 64-bit kernel
with a 32-bit userspace, as on x86_32 an 8-byte primitive
may be aligned to a 4-byte address. Pad with a __u32
to fix this.

Change-Id: I4374ed2cc3ccd3c6a1474cb7209b53ebfd91077b
Signed-off-by: Martijn Coenen <[email protected]>

BACKPORT: ARM: 8091/2: add get_user() support for 8 byte types

Recent contributions, including to DRM and binder, introduce 64-bit
values in their interfaces. A common motivation for this is to allow
the same ABI for 32- and 64-bit userspaces (and therefore also a shared
ABI for 32/64 hybrid userspaces). Anyhow, the developers would like to
avoid gotchas like having to use copy_from_user().

This feature is already implemented on x86-32 and the majority of other
32-bit architectures. The current list of get_user_8 hold out
architectures are: arm, avr32, blackfin, m32r, metag, microblaze,
mn10300, sh.

Credit:

    My name sits rather uneasily at the top of this patch. The v1 and
    v2 versions of the patch were written by Rob Clark and to produce v4
    I mostly copied code from Russell King and H. Peter Anvin. However I
    have mangled the patch sufficiently that *blame* is rightfully mine
    even if credit should more widely shared.

Changelog:

v5: updated to use the ret macro (requested by Russell King)
v4: remove an inlined add on big endian systems (spotted by Russell King),
    used __ARMEB__ rather than BIG_ENDIAN (to match rest of file),
    cleared r3 on EFAULT during __get_user_8.
v3: fix a couple of checkpatch issues
v2: pass correct size to check_uaccess, and better handling of narrowing
    double word read with __get_user_xb() (Russell King's suggestion)
v1: original

Change-Id: I41787d73f0844c15b6bd0424a5f83cafaba8b508
Signed-off-by: Rob Clark <[email protected]>
Signed-off-by: Daniel Thompson <[email protected]>
Signed-off-by: Russell King <[email protected]>
[flex1911: backport to 3.4: use "mov pc" instruction instead of
           nonexistent here "ret" macro]
Signed-off-by: Artem Borisov <[email protected]>
Signed-off-by: D. Andrei Măceș <[email protected]>

staging: binder: Improve Kconfig entry for ANDROID_BINDER_IPC_32BIT

Add a more clear explanation of the option in the prompt, and
make the config depend on ANDROID_BINDER_IPC being selected.

Also sets the default to y, which matches AOSP.

Change-Id: Ib85275f415a3a10b3b40e0f847cb43fce69c203b
Cc: Colin Cross <[email protected]>
Cc: Arve Hjønnevåg <[email protected]>
Cc: Serban Constantinescu <[email protected]>
Cc: Android Kernel Team <[email protected]>
Signed-off-by: John Stultz <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: D. Andrei Măceș <[email protected]>

UPSTREAM: drivers: android: correct the size of struct binder_uintptr_t for BC_DEAD_BINDER_DONE

There's one point was missed in the patch commit da49889deb34 ("staging:
binder: Support concurrent 32 bit and 64 bit processes."). When configure
BINDER_IPC_32BIT, the size of binder_uintptr_t was 32bits, but size of
void * is 64bit on 64bit system. Correct it here.

Signed-off-by: Lisa Du <[email protected]>
Signed-off-by: Nicolas Boichat <[email protected]>
Fixes: da49889deb34 ("staging: binder: Support concurrent 32 bit and 64 bit processes.")
Cc: <[email protected]>
Acked-by: Olof Johansson <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
(cherry picked from commit 7a64cd887fdb97f074c3fda03bee0bfb9faceac3)
[drinkcat: upstream moved drivers/staging/android to drivers/android]

BUG=b:26833439
TEST=See b:26833439 comment #22

Signed-off-by: Nicolas Boichat <[email protected]>
Change-Id: I4a3fd73fea1cbd5c8ea0a9f33c17ef25006ad0dd
Signed-off-by: D. Andrei Măceș <[email protected]>

ANDROID: binder: fix compilation warnings.

When using the 64-bit binder interface on 32-bit
kernels. These are safe to cast, because they are
buffers we have previously successfully called
copy_from_user() on.

Change-Id: Ifa4ef2f99b28202d3d1bd1417e441266a2609cec
Signed-off-by: Martijn Coenen <[email protected]>
Signed-off-by: D. Andrei Măceș <[email protected]>
alesaiko pushed a commit that referenced this issue Mar 2, 2018
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior
of waking their tasks up on idle CPUs. The feature has too much of a
negative impact on other workloads however to apply globally. The
PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this
flag set, or tasks woken by tasks with this flag set, on an idle CPU
if one is available.

Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039
Signed-off-by: Steve Muckle <[email protected]>

sched: add sysctl for controlling task migrations on wake

The PF_WAKE_UP_IDLE per-task flag made it impossible to enable
the old behavior of SD_SHARE_PKG_RESOURCES, where every task
migrates to an idle CPU on wakeup.

The sched_wake_to_idle sysctl value, when made nonzero, will cause
all tasks to migrate to an idle CPU if one is available when the
task is woken up. This is regardless of how PF_WAKE_UP_IDLE is
configured for tasks in the system. Similar to PF_WAKE_UP_IDLE,
the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled
for the sysctl value to have an effect.

Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61
Signed-off-by: Steve Muckle <[email protected]>
(cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282)

Conflicts:

	kernel/sched/fair.c

sched: Reset rq->next_interval before going idle

next_balance, the point in jiffy time scale when a cpu will next load
balance, could have been calculated when the cpu was busy. A busy cpu
will apply its sched domain's busy_factor (usually > 1) in computing
next_balance for that sched domain, which causes the (busy) cpu to load
balance less frequently in its sched domains. However when the same cpu
is going idle, its next_balance needs to be reset without consideration
of busy_factor. Failure to do so would not trigger nohz idle balancer on
that cpu for unnecessarily long time (introducing additional scheduling
latencies for tasks). Fix bug in scheduler which aims to reset
next_balance before a cpu goes idle (as per existing comment) but is
clearly not doing so.

Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
(just when they would have exceeded their ideal_runtime). It makes use
of a a per-cpu hrtimer resource and hence alarming that hrtimer should
be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs,
rather than being based on number of tasks in a particular cfs_rq (as
implemented currently). As a result, with current code, its possible for
a running task (which is the sole task in its cfs_rq) to be preempted
much after its ideal_runtime has elapsed, resulting in increased latency
for tasks in other cfs_rq on same cpu.

Fix this by alarming sched hrtimer based on total number of SCHED_FAIR
tasks a CPU has across its various cfs_rqs.

Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Mark schedule_io_timeout() with EXPORT_SYMBOL

Make schedule_io_timeout() visible to modules.

Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2
Signed-off-by: Jordan Crouse <[email protected]>

sched: fix reference to wrong cfs_rq

Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption
of tasks) introduced a bug in sched_slice() calculation by using wrong
cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather
than the correct one to which they belonged.

Fix the bug by using correct cfs_rq for tasks.

Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Make sure to not re-read variables after validation

We could re-read rq->rt_avg after we validated it was smaller than
total, invalidating the check and resulting in an unintended negative.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: David Rientjes <[email protected]>
Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
CRs-Fixed: 497236
Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
(cherry picked from commit fef154fa32c5d8ef992f11e11919495979970678)

Change-Id: Ieac7a10d9f4ceb701660f8055efd7a0ed5d939ca
Signed-off-by: Sridhar Nelloru <[email protected]>

sched: re-calculate a cpu's next_balance point upon sched domain changes

Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset
a cpu's rq->next_balance when pulled_task = 0, which will be true when
the cpu failed to pull any task, causing it go idle. However that patch
relied on next_balance being calculated as a result of traversing cpu's
sched domain hierarchy.

A cpu that is the only online cpu will however not be attached to any
sched domain hierarchy. When such a cpu calls into idle_balance(), we
will end up initializing next_balance to be 1sec away! Such a CPU will
defer load balance check for another 1sec, even though we may bring up
more cpus in the meantime requiring it to check for load imbalance more
frequently. This could then lead to increased scheduling latency for
some tasks.

This patch results in a cpu's next_balance being re-calculated when its
attaching to a new sched domain hierarchy.  This should let cpus call
load balance checks at the right time we expect them to!

Signed-off-by: Srivatsa Vaddagiri <[email protected]>
(cherry picked from commit 91b609bdd91513c37c410ffef0204f8a0578e4f8)

Change-Id: Ib457f1a357aad26e49837a01311b73e22e0cc26f
Signed-off-by: Sridhar Nelloru <[email protected]>

sched: Unthrottle rt runqueues in __disable_runtime()

migrate_tasks() uses _pick_next_task_rt() to get tasks from the
real-time runqueues to be migrated. When rt_rq is throttled
_pick_next_task_rt() won't return anything, in which case
migrate_tasks() can't move all threads over and gets stuck in an
infinite loop.

Instead unthrottle rt runqueues before migrating tasks.

Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()

Signed-off-by: Peter Boonstoppel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix clear NOHZ_BALANCE_KICK

I have faced a sequence where the Idle Load Balance was sometime not
triggered for a while on my platform, in the following scenario:

 CPU 0 and CPU 1 are running tasks and CPU 2 is idle

 CPU 1 kicks the Idle Load Balance
 CPU 1 selects CPU 2 as the new Idle Load Balancer
 CPU 2 sets NOHZ_BALANCE_KICK for CPU 2
 CPU 2 sends a reschedule IPI to CPU 2

 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2

 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
       task A quickly goes back to sleep (before a tick occurs on CPU 2)
 CPU 2 goes back to idle with NOHZ_BALANCE_KICK set

Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent
because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be
performed.

We must wait for the sched softirq to be raised on CPU 2 thanks to another
part the kernel to come back to clear NOHZ_BALANCE_KICK.

The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
we can't raise the sched_softirq for the Idle Load Balance.

Change since V1:

- move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB
  can't run on this CPU (as suggested by Peter)

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 873b4c65b519fd769940eb281f77848227d4e5c1
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: minor merge resolution for 3.4 in scheduler_ipi()]
Signed-off-by: Steve Muckle <[email protected]>
Change-Id: I3548612057cccc2ecc29429c129c44183083831f

sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s

try_to_wake_up_local() should only be invoked to wake up another
task in the same runqueue and BUG_ON()s are used to enforce the
rule. Missing try_to_wake_up_local() can stall workqueue
execution but such stalls are likely to be finite either by
another work item being queued or the one blocked getting
unblocked.  There's no reason to trigger BUG while holding rq
lock crashing the whole system.

Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s.

Change-Id: I75fdaaf4dcaefcf3893e0404d98c8aa91f89934d
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Git-commit: 383efcd00053ec40023010ce5034bd702e7ab373
CRs-Fixed: 519942
Signed-off-by: Steve Muckle <[email protected]>

sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local()

The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local()
were recently converted from BUG_ON() calls. If these hit it indicates
something is wrong and that may contribute to other system instability.
To eliminate the risk of an instance of one of these errors going
un-noticed because there was an earlier instance that occured long ago,
change to WARN_ON(). If there ever is a flood of these there are bigger
problems.

Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7
Signed-off-by: Steve Muckle <[email protected]>

sched: Convert WARN_ON() to printk_sched() in try_to_wake_up_local()

try_to_wake_up_local() is called with the rq lock held. Printing to
console in this context can result in a deadlock if klogd needs to
be woken up. Print to the kernel log buffer via printk_sched()
instead which avoids the wakeup.

Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c
Signed-off-by: Syed Rameez Mustafa <[email protected]>

sched/rt: Use root_domain of rt_rq not current processor

When the system has multiple domains do_sched_rt_period_timer()
can run on any CPU and may iterate over all rt_rq in
cpu_online_mask.  This means when balance_runtime() is run for a
given rt_rq that rt_rq may be in a different rd than the current
processor.  Thus if we use smp_processor_id() to get rd in
do_balance_runtime() we may borrow runtime from a rt_rq that is
not part of our rd.

This changes do_balance_runtime to get the rd from the passed in
rt_rq ensuring that we borrow runtime only from the correct rd
for the given rt_rq.

This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime
when we try reclaim runtime lent to other rt_rq but runtime has
been lent to a rt_rq in another rd.

Change-Id: Id8b9f88bdd783ef87bfa8d288a2fe7e86dd0b1e2
Signed-off-by: Shawn Bohrer <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Acked-by: Mike Galbraith <[email protected]>
Cc: [email protected]
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 89960feebaf4f9a53f93a0ce6888207e4a808799
Git-repo: git://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git
Signed-off-by: Syed Rameez Mustafa <[email protected]>

sched: Remove redundant update_runtime notifier

migration_call() will do all the things that update_runtime() does.
So let's remove it.

Furthermore, there is potential risk that the current code will catch
BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime
threads running because of enabling runtime twice while the rt_runtime
may already changed.

Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8
Signed-off-by: Neil Zhang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: resolved trivial header file context conflict]
Signed-off-by: Matt Wagantall <[email protected]>

sched: Fix hotplug vs. set_cpus_allowed_ptr()

Lai found that:

  WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b()
  ...
  migration_cpu_stop+0x1d/0x22

was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is
always a sub-set of cpu_online_mask.

This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness").

So set active and online at the same time to avoid this particular
problem.

CRs-Fixed: 680496
Change-Id: I89ac9b6829acf200072975bc7d028a469167f083
Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness")
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Gautham R. Shenoy <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael wang <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srivatsa S. Bhat <[email protected]>
Cc: Toshi Kani <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 24d52daafcd53f109893ec2faf668c8eeef4a382
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Matt Wagantall <[email protected]>

sched: Fix race between try_to_wake_up() and move_task()

Until a task's state has been seen as interruptible/uninterruptible
and it is no longer on_cpu, it is possible that the task may move
to another CPU (load balancing may cause this). Here is an example
where the race condition results in incorrect operation:

- cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING
- cpu 0 runs task B, which attempts to wake up A
- cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0
- cpu 1 then pulls task A (perhaps due to idle balance)
- cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE
- cpu 0 continues in try_to_wake_up(), thinking task A's previous
  cpu is 0, where it is actually 1
- if select_task_rq returns cpu 0, task A will be woken up on cpu 0
  without properly updating its cpu to 0 in set_task_cpu()

CRs-Fixed: 665958
Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99
Signed-off-by: Steve Muckle <[email protected]>

sched: Fix reference to stale task_struct in try_to_wake_up()

try_to_wake_up() currently drops p->pi_lock and later checks for need
to notify cpufreq governor on task migrations or wakeups. However the
woken task could exit between the time p->pi_lock is released and the
time the test for notification is run. As a result, the test for
notification could refer to an exited task. task_notify_on_migrate(p)
could thus lead to invalid memory reference.

Fix this by running the test for notification with task's pi_lock
held.

Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Bail out of yield_to when source and target runqueue has one task

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple handler upon
 seeing source runqueue length = 1, but it had to export rq length).
 Peter came up with the elegant idea of return -ESRCH in scheduler core.

Signed-off-by: Peter Zijlstra <[email protected]>
Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Andrew Jones <[email protected]>
Tested-by: Chegu Vinod <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>

Change-Id: I4eca8afc683c1a4b9d2490422baacd3da76c1e92

sched: Fix select_idle_sibling() bouncing cow syndrome

If the previous CPU is cache affine and idle, select it.

The current implementation simply traverses the sd_llc domain,
taking the first idle CPU encountered, which walks buddy pairs
hand in hand over the package, inflicting excruciating pain.

1 tbench pair (worst case) in a 10 core + SMT package:

  pre   15.22 MB/sec 1 procs
  post 252.01 MB/sec 1 procs

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix double normalization of vruntime

dequeue_entity() is called when p->on_rq and sets se->on_rq = 0
which appears to guarentee that the !se->on_rq condition is met.
If the task has done set_current_state(TASK_INTERRUPTIBLE) without
schedule() the second condition will be met and vruntime will be
incorrectly adjusted twice.

In certain cases this can result in the task's vruntime never increasing
past the vruntime of other tasks on the CFS' run queue, starving them of
CPU time.

This patch changes switched_from_fair() to use !p->on_rq instead of
!se->on_rq.

I'm able to cause a task with a priority of 120 to starve all other
tasks with the same priority on an ARM platform running 3.2.51-rt72
PREEMPT RT by writing one character at time to a serial tty (16550 UART)
in a tight loop. I'm also able to verify making this change corrects the
problem on that platform and kernel version.

Change-Id: I70aa0fe0e9be146b1a2a903d010b318390262dc3
Signed-off-by: George McCollister <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix divide by zero at {thread_group,task}_times

On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.

This problem can be triggered in practice on very long lived processes:

  PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
   #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
   #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
   #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
   #3 [ffff880472a51cd0] die at ffffffff8100f26b
   #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
   #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
   #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
      [exception RIP: thread_group_times+0x56]
      RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
      RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
      RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
      RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
      R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
   #8 [ffff880472a51f40] sys_times at ffffffff81088524
   #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
      RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
      RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
      RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
      RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
      R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
      ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b

Change-Id: I1e0fa1fa38b86566f5b20e5c0efa0c9eb17b203b
Signed-off-by: Stanislaw Gruszka <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched_clock: Prevent 64bit inatomicity on 32bit systems

The sched_clock_remote() implementation has the following inatomicity
problem on 32bit systems when accessing the remote scd->clock, which
is a 64bit value.

CPU0			    CPU1

sched_clock_local()	    sched_clock_remote(CPU0)
...
			    remote_clock = scd[CPU0]->clock
			    read_low32bit(scd[CPU0]->clock)
cmpxchg64(scd->clock,...)
			    read_high32bit(scd[CPU0]->clock)

While the update of scd->clock is using an atomic64 mechanism, the
readout on the remote cpu is not, which can cause completely bogus
readouts.

It is a quite rare problem, because it requires the update to hit the
narrow race window between the low/high readout and the update must go
across the 32bit boundary.

The resulting misbehaviour is, that CPU1 will see the sched_clock on
CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
to this bogus timestamp. This stays that way due to the clamping
implementation for about 4 seconds until the synchronization with
CLOCK_MONOTONIC undoes the problem.

The issue is hard to observe, because it might only result in a less
accurate SCHED_OTHER timeslicing behaviour. To create observable
damage on realtime scheduling classes, it is necessary that the bogus
update of CPU1 sched_clock happens in the context of an realtime
thread, which then gets charged 4 seconds of RT runtime, which results
in the RT throttler mechanism to trigger and prevent scheduling of RT
tasks for a little less than 4 seconds. So this is quite unlikely as
well.

The issue was quite hard to decode as the reproduction time is between
2 days and 3 weeks and intrusive tracing makes it less likely, but the
following trace recorded with trace_clock=global, which uses
sched_clock_local(), gave the final hint:

  <idle>-0   0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
  <idle>-0   0d..30 400269.477151: hrtimer_start:  hrtimer=0xf7061e80 ...
irq/20-S-587 1d..32 400273.772118: sched_wakeup:   comm= ... target_cpu=0
  <idle>-0   0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80

What happens is that CPU0 goes idle and invokes
sched_clock_idle_sleep_event() which invokes sched_clock_local() and
CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
sched_remote_clock(). The time jump gets propagated to CPU0 via
sched_remote_clock() and stays stale on both cores for ~4 seconds.

There are only two other possibilities, which could cause a stale
sched clock:

1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
   wrong value.

2) sched_clock() which reads the TSC returns a sporadic wrong value.

   one jiffy and then go stale.

   forward. It would just result in a stale sched_clock for one jiffy.

After quite some brain twisting and finding the same pattern on other
traces, sched_clock_remote() remained the only place which could cause
such a problem and as explained above it's indeed racy on 32bit
systems.

So while on 64bit systems the readout is atomic, we need to verify the
remote readout on 32bit machines. We need to protect the local->clock
readout in sched_clock_remote() on 32bit as well because an NMI could
hit between the low and the high readout, call sched_clock_local() and
modify local->clock.

Thanks to Siegfried Wulsch for bearing with my debug requests and
going through the tedious tasks of running a bunch of reproducer
systems to generate the debug information which let me decode the
issue.

Change-Id: I767a6a645e17651b39d0a530dc0681756f3b4f5d
Reported-by: Siegfried Wulsch <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities

This patch touches the RT group scheduling case.

Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
priority, while rt_rq passed to them may be not the top-level rt_rq.
This is wrong, because changing of priority on a child level does not
guarantee that the priority is the highest all over the rq. So, this
leak makes RT balancing unusable.

The short example: the task having the highest priority among all rq's
RT tasks (no one other task has the same priority) are waking on a
throttle rt_rq.  The rq's cpupri is set to the task's priority
equivalent, but real rq->rt.highest_prio.curr is less.

The patch below fixes the problem.

Change-Id: I01d3cc07870b1825efa1deed3b70f060ac20939f
Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
CC: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Guarantee new group-entities always have weight

Currently, group entity load-weights are initialized to zero. This
admits some races with respect to the first time they are re-weighted in
earlty use. ( Let g[x] denote the se for "g" on cpu "x". )

Suppose that we have root->a and that a enters a throttled state,
immediately followed by a[0]->t1 (the only task running on cpu[0])
blocking:

  put_prev_task(group_cfs_rq(a[0]), t1)
  put_prev_entity(..., t1)
  check_cfs_rq_runtime(group_cfs_rq(a[0]))
  throttle_cfs_rq(group_cfs_rq(a[0]))

Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
time:

  enqueue_task_fair(rq[0], t2)
  enqueue_entity(group_cfs_rq(b[0]), t2)
  enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
  account_entity_enqueue(group_cfs_ra(b[0]), t2)
  update_cfs_shares(group_cfs_rq(b[0]))
  < skipped because b is part of a throttled hierarchy >
  enqueue_entity(group_cfs_rq(a[0]), b[0])
  ...

We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
which violates invariants in several code-paths. Eliminate the
possibility of this by initializing group entity weight.

Change-Id: I1b09c5c4f85df1de565e7c862ba8dde65851c05f
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Chris J Arges <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Fix isolated CPUs leaving root_task_group indefinitely throttled

Root task group bandwidth replenishment must service all CPUs, regardless of
where the timer was last started, and regardless of the isolation mechanism,
lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.

Change-Id: I8485724e3ff0960008aa13634e55bcb6ceacbc01
Signed-off-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Avoid updating RT entry timeout twice within one tick period

The issue below was found in 2.6.34-rt rather than mainline rt kernel, but
the issue still exists upstream as well.

So please let me describe how it was noticed on 2.6.34-rt:

On this version, each softirq has its own thread, which means there is at
least one RT FIFO task per cpu.  The priority of these tasks is set to 49
by default.  If user launches an RT FIFO task with priority lower than 49
of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued
one cpu runqueue at one moment. By current strategy of balancing RT tasks,
when it comes to RT tasks, we really need to put them off to a CPU that
they can run on as soon as possible. Even if it means a bit of cache line
flushing, we want RT tasks to be run with the least latency.

When the user RT FIFO task which just launched before is running, the
sched timer tick of the current cpu happens. In this tick period, the
timeout value of the user RT task will be updated once. Subsequently,
we try to wake up one softirq RT task on its local cpu. As the priority
of current user RT task is lower than the softirq RT task, the current
task will be preempted by the higher priority softirq RT task.  Before
preemption, we check to see if current can readily move to a different
cpu. If so, we will reschedule to allow the RT push logic to try to move
current somewhere else. Whenever the woken softirq RT task runs, it first
tries to migrate the user FIFO RT task over to a cpu that is running a
task of lesser priority. If migration is done, it will send a reschedule
request to the found cpu by IPI interrupt. Once the target cpu responds
the IPI interrupt, it will pick the migrated user RT task to preempt its
current task. When the user RT task is running on the new cpu, the sched
timer tick of the cpu fires. So it will tick the user RT task again. This
also means the RT task timeout value will be updated again.  As the
migration may be done in one tick period, it means the user RT task
timeout value will be updated twice within one tick.

If we set a limit on the amount of cpu time for the user RT task by
setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon
reaching the soft limit.

But exactly when the SIGXCPU signal should be sent depends on the RT task
timeout value. In fact the timeout mechanism of sending the SIGXCPU signal
assumes the RT task timeout is increased once every tick.

However, currently the timeout value may be added twice per tick. So it
results in the SIGXCPU signal being sent earlier than expected.

To solve this issue, we prevent the timeout value from increasing twice
within one tick time by remembering the jiffies value of last updating the
timeout. As long as the RT task's jiffies is different with the global
jiffies value, we allow its timeout to be updated.

Signed-off-by: Ying Xue <[email protected]>
Signed-off-by: Fan Du <[email protected]>
Reviewed-by: Yong Zhang <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
[ lizf: backported to 3.4: adjust context ]
Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Change-Id: I1d126959d88f3bc1adb4a707661efb7162048424

sched: rt: Reduce rq lock contention by eliminating locking of non-feasible target

This patch adds checks that prevens futile attempts to move rt tasks to a
CPU with active tasks of equal or higher priority.

This reduces run queue lock contention and improves the performance of a
well known OLTP benchmark by 0.7%.

Change-Id: I232abe50623593d51be508fc64e8b865dcc77212
Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Shawn Bohrer <[email protected]>
Cc: Suruchi Kadu <[email protected]>
Cc: Doug Nelson<[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

sched: Fix RLIMIT_RTTIME when PI-boosting to RT

When non-realtime tasks get priority-inheritance boosted to a realtime
scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
counter used for checking this (the same one used for SCHED_RR
timeslices) was not getting reset. This meant that tasks running with a
non-realtime scheduling class which are repeatedly boosted to a realtime
one, but never block while they are running realtime, eventually hit the
timeout without ever running for a time over the limit. This patch
resets the realtime timeslice counter when un-PI-boosting from an RT to
a non-RT scheduling class.

I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
mutex which induces priority boosting and spins while boosted that gets
killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
does happen eventually with PREEMPT_VOLUNTARY kernels.

Change-Id: I732036119517d9a9625cea4644b4509c1d6cdf13
Signed-off-by: Brian Silverman <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
[lizf: Backported to 3.4: adjust contest]
Signed-off-by: Zefan Li <[email protected]>

sched: Queue RT tasks to head when prio drops

The following scenario does not work correctly:

Runqueue of CPUx contains two runnable and pinned tasks:

 T1: SCHED_FIFO, prio 80
 T2: SCHED_FIFO, prio 80

T1 is on the cpu and executes the following syscalls (classic priority
ceiling scenario):

 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90);
 ...
 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80);
 ...

Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back
to sleep the scheduler picks T2. Surprise!

The same happens w/o actual preemption when T1 is forced into the
scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes
pick_next_task() which returns T2. So T1 gets preempted and scheduled
out.

This happens because sched_setscheduler() dequeues T1 from the prio 90
list and then enqueues it on the tail of the prio 80 list behind T2.
This violates the POSIX spec and surprises user space which relies on
the guarantee that SCHED_FIFO tasks are not scheduled out unless they
give the CPU up voluntarily or are preempted by a higher priority
task. In the latter case the preempted task must get back on the CPU
after the preempting task schedules out again.

We fixed a similar issue already in commit 60db48c (sched: Queue a
deboosted task to the head of the RT prio queue). The same treatment
is necessary for sched_setscheduler(). So enqueue to head of the prio
bucket list if the priority of the task is lowered.

It might be possible that existing user space relies on the current
behaviour, but it can be considered highly unlikely due to the corner
case nature of the application scenario.

Change-Id: I65f159b3c2d265dcaf27b70bf9be88cf922ab845
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

sched: Fix the theoretical signal_wake_up() vs schedule() race

This is only theoretical, but after try_to_wake_up(p) was changed to check
p->state under p->pi_lock the code like:

	__set_current_state(TASK_INTERRUPTIBLE);
	schedule();

can miss a signal.  This is the special case of wait-for-condition, it
relies on try_to_wake_up/schedule interaction and thus it does not need
mb() between __set_current_state() and if (signal_pending).

However, this __set_current_state() can move into the critical section
protected by rq->lock, now that try_to_wake_up() takes another lock we
need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.

The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock).  This is what try_to_wake_up() already
does by the same reason.

We turn this wmb() into the new helper, smp_mb__before_spinlock(), for
better documentation and to allow the architectures to change the default
implementation.

While at it, kill smp_mb__after_lock(), it has no callers.

Perhaps we can also add smp_mb__before/after_spinunlock() for
prepare_to_wait().

Change-Id: I2b329b2ae213cc9221811ca244d2ebb044011a1b
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix race in idle_balance()

The scheduler main function 'schedule()' checks if there are no more tasks
on the runqueue. Then it checks if a task should be pulled in the current
runqueue in idle_balance() assuming it will go to idle otherwise.

But idle_balance() releases the rq->lock in order to look up the sched
domains and takes the lock again right after. That opens a window where
another cpu may put a task in our runqueue, so we won't go to idle but
we have filled the idle_stamp, thinking we will.

This patch closes the window by checking if the runqueue has been modified
but without pulling a task after taking the lock again, so we won't go to idle
right after in the __schedule() function.

Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8
Signed-off-by: Daniel Lezcano <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix load avg vs cpu-hotplug

Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.

Reported-by: Rakib Mullick <[email protected]>
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix load avg vs. cpu-hotplug

Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an
incomplete fix:

In particular, the problem is that at the point it calls
calc_load_migrate() nr_running := 1 (the stopper thread), so move the
call to CPU_DEAD where we're sure that nr_running := 0.

Also note that we can call calc_load_migrate() without serialization, we
know the state of rq is stable since its cpu is dead, and we modify the
global state using appropriate atomic ops.

Change-Id: I24063eebf4429d7e01838af2cb68e0f3c2cf2131
Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>

sched/nohz: Rewrite and fix load-avg computation -- again

Thanks to Charles Wang for spotting the defects in the current code:

 - If we go idle during the sample window -- after sampling, we get a
   negative bias because we can negate our own sample.

 - If we wake up during the sample window we get a positive bias
   because we push the sample to a known active period.

So rewrite the entire nohz load-avg muck once again, now adding
copious documentation to the code.

Reported-and-tested-by: Doug Smythies <[email protected]>
Reported-and-tested-by: Charles Wang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
[ minor edits ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Change-Id: Ib735002c79eaace9fa5f7795fdc2f80b638b56a7

sched: Fix the broken sched_rr_get_interval()

The caller of sched_sliced() should pass se.cfs_rq and se as the
arguments, however in sched_rr_get_interval() we gave it
rq.cfs_rq and se, which made the following computation obviously
wrong.

The change was introduced by commit:

  77034937dc45 sched: fix crash in sys_sched_rr_get_interval()

... 5 years ago, while it had been the correct 'cfs_rq_of' before
the commit. The change seems to be irrelevant to the commit
msg, which was to return a 0 timeslice for tasks that are on an
idle runqueue. So I believe that was just a plain typo.

Signed-off-by: Zhu Yanhai <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Since this is an ABI and an old bug, we'll test this via a
  slow upstream route, to hopefully discover any app breakage. ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix race in task_group()

Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call
task_group() too many times in set_task_rq()").  He found the reason to be
that the multiple task_group() invocations in set_task_rq() returned
different values.

Looking at all that I found a lack of serialization and plain wrong
comments.

The below tries to fix it using an extra pointer which is updated under
the appropriate scheduler locks. Its not pretty, but I can't really see
another way given how all the cgroup stuff works.

Change-Id: I6906c92ea6f66d19af5938a9f5ec9f1ff4577c39
Reported-and-tested-by: Stefan Bader <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume

In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.

However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.

In order to achieve that, do the following:

1. Don't modify cpusets during suspend/resume. At all.
   In particular, don't move the tasks from one cpuset to another, and
   don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
   during the CPU hotplug operations that are carried out in the
   suspend/resume path.

2. However, cpusets and sched domains are related. We just want to avoid
   altering cpusets alone. So, to keep the sched domains updated, build
   a single sched domain (containing all active cpus) during each of the
   CPU hotplug operations carried out in s/r path, effectively ignoring
   the cpusets' cpus_allowed masks.

   (Since userspace is frozen while doing all this, it will go unnoticed.)

3. During the last CPU online operation during resume, build the sched
   domains by looking up the (unaltered) cpusets' cpus_allowed masks.
   That will bring back the system to the same original state as it was in
   before suspend.

Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.

Change-Id: I12c3e6289d5aa069531f8f616d63412b38103d99
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Preeti U Murthy <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones

There is a small race between copy_process() and cgroup_attach_task()
where child->se.parent,cfs_rq points to invalid (old) ones.

        parent doing fork()      | someone moving the parent to another cgroup
  -------------------------------+---------------------------------------------
    copy_process()
      + dup_task_struct()
        -> parent->se is copied to child->se.
           se.parent,cfs_rq of them point to old ones.

                                     cgroup_attach_task()
                                       + cgroup_task_migrate()
                                         -> parent->cgroup is updated.
                                       + cpu_cgroup_attach()
                                         + sched_move_task()
                                           + task_move_group_fair()
                                             +- set_task_rq()
                                                -> se.parent,cfs_rq of parent
                                                   are updated.

      + cgroup_fork()
        -> parent->cgroup is copied to child->cgroup. (*1)
      + sched_fork()
        + task_fork_fair()
          -> se.parent,cfs_rq of child are accessed
             while they point to old ones. (*2)

In the worst case, this bug can lead to "use-after-free" and cause a panic,
because it's new cgroup's refcount that is incremented at (*1),
so the old cgroup(and related data) can be freed before (*2).

In fact, a panic caused by this bug was originally caught in RHEL6.4.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0
    [...]
    Call Trace:
     [<ffffffff81051f25>] place_entity+0x75/0xa0
     [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160
     [<ffffffff81063c0b>] sched_fork+0x6b/0x140
     [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450
     [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130
     [<ffffffff8106d2f4>] do_fork+0x94/0x460
     [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100
     [<ffffffff81009598>] sys_clone+0x28/0x30
     [<ffffffff8100b393>] stub_clone+0x13/0x20
     [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

Signed-off-by: Daisuke Nishimura <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/nohz: Fix rq->cpu_load[] calculations

While investigating why the load-balancer did funny I found that the
rq->cpu_load[] tables were completely screwy.. a bit more digging
revealed that the updates that got through were missing ticks followed
by a catchup of 2 ticks.

The catchup assumes the cpu was idle during that time (since only nohz
can cause missed ticks and the machine is idle etc..) this means that
esp. the higher indices were significantly lower than they ought to
be.

The reason for this is that its not correct to compare against jiffies
on every jiffy on any other cpu than the cpu that updates jiffies.

This patch cludges around it by only doing the catch-up stuff from
nohz_idle_balance() and doing the regular stuff unconditionally from
the tick.

Change-Id: Iac6eda1ded5074afb261ccc0fe8a4e70432bb031
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Venkatesh Pallipadi <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/nohz: Fix rq->cpu_load calculations some more

commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream.

Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[]
calculations") since while that fixed the busy case it regressed the
mostly idle case.

Add a callback from the nohz exit to also age the rq->cpu_load[]
array. This closes the hole where either there was no nohz load
balance pass during the nohz, or there was a 'significant' amount of
idle time between the last nohz balance and the nohz exit.

So we'll update unconditionally from the tick to not insert any
accidental 0 load periods while busy, and we try and catch up from
nohz idle balance and nohz exit. Both these are still prone to missing
a jiffy, but that has always been the case.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Venkatesh Pallipadi <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/core: Clear the root_domain cpumasks in init_rootdomain()

root_domain::rto_mask allocated through alloc_cpumask_var() contains
garbage data, and this may cause problems.  For instance, when doing
pull_rt_task(), it may do useless iterations if rto_mask retains some
extra garbage bits. Worse still, this violates the isolated domain rule
for clustered scheduling using cpuset, because the tasks (with all the
cpus allowed) belongs to one root domain can be pulled away into another
root domain.

The patch cleans the garbage by using zalloc_cpumask_var() instead of
alloc_cpumask_var() for root_domain::rto_mask allocation, thereby
addressing the issues.

Do the same thing for root_domain's other cpumask memembers:
- span,
- online;

Signed-off-by: Xunlei Pang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

softirq: Reduce latencies

In various network workloads, __do_softirq() latencies can be up to 20 ms
if HZ=1000, and 200 ms if HZ=100.

This is because we iterate 10 times in the softirq dispatcher, and some
actions can consume a lot of cycles.

This patch changes the fallback to ksoftirqd condition to:
- A time limit of 2 ms,
- need_resched() being set on current task;

When one of this condition is met, we wakeup ksoftirqd for further softirq
processing if we still have pending softirqs.

Using need_resched() as the only condition can trigger RCU stalls, as we
can keep BH disabled for too long.

I ran several benchmarks and got no significant difference in throughput,
but a very significant reduction of latencies (one order of magnitude):

In following bench, 200 antagonist "netperf -t TCP_RR" are started in
background, using all available cpus.

Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC
IRQ (hard + soft).

Before patch:

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=550110.424
MIN_LATENCY=146858
MAX_LATENCY=997109
P50_LATENCY=305000
P90_LATENCY=550000
P99_LATENCY=710000
MEAN_LATENCY=376989.12
STDDEV_LATENCY=184046.92

After patch:

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=40545.492
MIN_LATENCY=9834
MAX_LATENCY=78366
P50_LATENCY=33583
P90_LATENCY=59000
P99_LATENCY=69000
MEAN_LATENCY=38364.67
STDDEV_LATENCY=12865.26

Signed-off-by: Eric Dumazet <[email protected]>
Cc: David Miller <[email protected]>
Cc: Tom Herbert <[email protected]>
Cc: Ben Hutchings <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

softirq: Fix lockup related to stop_machine being stuck in __do_softirq.

The stop machine logic can lock up if all but one of the migration threads
make it through the disable-irq step and the one remaining thread gets
stuck in __do_softirq. The reason __do_softirq can hang is that it has a
bail-out based on jiffies timeout, but in the lockup case, jiffies itself
is not incremented.

To work around this, re-add the max_restart counter in __do_irq and stop
processing irqs after 10 restarts.

Thanks to Tejun Heo and Rusty Russell and others for helping me track this
down.

This was introduced in 3.9 by commit c10d73671ad3
("softirq: reduce latencies").

It may be worth looking into ath9k to see if it has issues with its irq
handler at a later date.

The hang stack traces look something like this:

    ------------[ cut here ]------------
    WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7()
    Watchdog detected hard LOCKUP on cpu 2
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    Pid: 23, comm: migration/2 Tainted: G         C   3.9.4+ #11
    Call Trace:
     <NMI>   warn_slowpath_common+0x85/0x9f
      warn_slowpath_fmt+0x46/0x48
      watchdog_overflow_callback+0x9c/0xa7
      __perf_event_overflow+0x137/0x1cb
      perf_event_overflow+0x14/0x16
      intel_pmu_handle_irq+0x2dc/0x359
      perf_event_nmi_handler+0x19/0x1b
      nmi_handle+0x7f/0xc2
      do_nmi+0xbc/0x304
      end_repeat_nmi+0x1e/0x2e
     <<EOE>>
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0
    ---[ end trace 4947dfa9b0a4cec3 ]---
    BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17]
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    irq event stamp: 835637905
    hardirqs last  enabled at (835637904): __do_softirq+0x9f/0x257
    hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80
    softirqs last  enabled at (5654720): __do_softirq+0x1ff/0x257
    softirqs last disabled at (5654725): irq_exit+0x5f/0xbb
    CPU 1
    Pid: 17, comm: migration/1 Tainted: G        WC   3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
    RIP: tasklet_hi_action+0xf0/0xf0
    Process migration/1
    Call Trace:
     <IRQ>
      __do_softirq+0x117/0x257
      irq_exit+0x5f/0xbb
      smp_apic_timer_interrupt+0x8a/0x98
      apic_timer_interrupt+0x72/0x80
     <EOI>
      printk+0x4d/0x4f
      stop_machine_cpu_stop+0x22c/0x274
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0

Change-Id: Ib213d5a5c59c85ae16afde6f42a01e39d2c93fd7
Signed-off-by: Ben Greear <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Pekka Riikonen <[email protected]>
Cc: Eric Dumazet <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <[email protected]>
Signed-off-by: Zefan Li <[email protected]>
alesaiko pushed a commit that referenced this issue Mar 11, 2018
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior
of waking their tasks up on idle CPUs. The feature has too much of a
negative impact on other workloads however to apply globally. The
PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this
flag set, or tasks woken by tasks with this flag set, on an idle CPU
if one is available.

Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039
Signed-off-by: Steve Muckle <[email protected]>

sched: add sysctl for controlling task migrations on wake

The PF_WAKE_UP_IDLE per-task flag made it impossible to enable
the old behavior of SD_SHARE_PKG_RESOURCES, where every task
migrates to an idle CPU on wakeup.

The sched_wake_to_idle sysctl value, when made nonzero, will cause
all tasks to migrate to an idle CPU if one is available when the
task is woken up. This is regardless of how PF_WAKE_UP_IDLE is
configured for tasks in the system. Similar to PF_WAKE_UP_IDLE,
the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled
for the sysctl value to have an effect.

Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61
Signed-off-by: Steve Muckle <[email protected]>
(cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282)

Conflicts:

	kernel/sched/fair.c

sched: Reset rq->next_interval before going idle

next_balance, the point in jiffy time scale when a cpu will next load
balance, could have been calculated when the cpu was busy. A busy cpu
will apply its sched domain's busy_factor (usually > 1) in computing
next_balance for that sched domain, which causes the (busy) cpu to load
balance less frequently in its sched domains. However when the same cpu
is going idle, its next_balance needs to be reset without consideration
of busy_factor. Failure to do so would not trigger nohz idle balancer on
that cpu for unnecessarily long time (introducing additional scheduling
latencies for tasks). Fix bug in scheduler which aims to reset
next_balance before a cpu goes idle (as per existing comment) but is
clearly not doing so.

Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
(just when they would have exceeded their ideal_runtime). It makes use
of a a per-cpu hrtimer resource and hence alarming that hrtimer should
be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs,
rather than being based on number of tasks in a particular cfs_rq (as
implemented currently). As a result, with current code, its possible for
a running task (which is the sole task in its cfs_rq) to be preempted
much after its ideal_runtime has elapsed, resulting in increased latency
for tasks in other cfs_rq on same cpu.

Fix this by alarming sched hrtimer based on total number of SCHED_FAIR
tasks a CPU has across its various cfs_rqs.

Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Mark schedule_io_timeout() with EXPORT_SYMBOL

Make schedule_io_timeout() visible to modules.

Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2
Signed-off-by: Jordan Crouse <[email protected]>

sched: fix reference to wrong cfs_rq

Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption
of tasks) introduced a bug in sched_slice() calculation by using wrong
cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather
than the correct one to which they belonged.

Fix the bug by using correct cfs_rq for tasks.

Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Make sure to not re-read variables after validation

We could re-read rq->rt_avg after we validated it was smaller than
total, invalidating the check and resulting in an unintended negative.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: David Rientjes <[email protected]>
Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
CRs-Fixed: 497236
Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
(cherry picked from commit fef154fa32c5d8ef992f11e11919495979970678)

Change-Id: Ieac7a10d9f4ceb701660f8055efd7a0ed5d939ca
Signed-off-by: Sridhar Nelloru <[email protected]>

sched: re-calculate a cpu's next_balance point upon sched domain changes

Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset
a cpu's rq->next_balance when pulled_task = 0, which will be true when
the cpu failed to pull any task, causing it go idle. However that patch
relied on next_balance being calculated as a result of traversing cpu's
sched domain hierarchy.

A cpu that is the only online cpu will however not be attached to any
sched domain hierarchy. When such a cpu calls into idle_balance(), we
will end up initializing next_balance to be 1sec away! Such a CPU will
defer load balance check for another 1sec, even though we may bring up
more cpus in the meantime requiring it to check for load imbalance more
frequently. This could then lead to increased scheduling latency for
some tasks.

This patch results in a cpu's next_balance being re-calculated when its
attaching to a new sched domain hierarchy.  This should let cpus call
load balance checks at the right time we expect them to!

Signed-off-by: Srivatsa Vaddagiri <[email protected]>
(cherry picked from commit 91b609bdd91513c37c410ffef0204f8a0578e4f8)

Change-Id: Ib457f1a357aad26e49837a01311b73e22e0cc26f
Signed-off-by: Sridhar Nelloru <[email protected]>

sched: Unthrottle rt runqueues in __disable_runtime()

migrate_tasks() uses _pick_next_task_rt() to get tasks from the
real-time runqueues to be migrated. When rt_rq is throttled
_pick_next_task_rt() won't return anything, in which case
migrate_tasks() can't move all threads over and gets stuck in an
infinite loop.

Instead unthrottle rt runqueues before migrating tasks.

Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()

Signed-off-by: Peter Boonstoppel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix clear NOHZ_BALANCE_KICK

I have faced a sequence where the Idle Load Balance was sometime not
triggered for a while on my platform, in the following scenario:

 CPU 0 and CPU 1 are running tasks and CPU 2 is idle

 CPU 1 kicks the Idle Load Balance
 CPU 1 selects CPU 2 as the new Idle Load Balancer
 CPU 2 sets NOHZ_BALANCE_KICK for CPU 2
 CPU 2 sends a reschedule IPI to CPU 2

 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2

 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
       task A quickly goes back to sleep (before a tick occurs on CPU 2)
 CPU 2 goes back to idle with NOHZ_BALANCE_KICK set

Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent
because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be
performed.

We must wait for the sched softirq to be raised on CPU 2 thanks to another
part the kernel to come back to clear NOHZ_BALANCE_KICK.

The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
we can't raise the sched_softirq for the Idle Load Balance.

Change since V1:

- move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB
  can't run on this CPU (as suggested by Peter)

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 873b4c65b519fd769940eb281f77848227d4e5c1
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: minor merge resolution for 3.4 in scheduler_ipi()]
Signed-off-by: Steve Muckle <[email protected]>
Change-Id: I3548612057cccc2ecc29429c129c44183083831f

sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s

try_to_wake_up_local() should only be invoked to wake up another
task in the same runqueue and BUG_ON()s are used to enforce the
rule. Missing try_to_wake_up_local() can stall workqueue
execution but such stalls are likely to be finite either by
another work item being queued or the one blocked getting
unblocked.  There's no reason to trigger BUG while holding rq
lock crashing the whole system.

Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s.

Change-Id: I75fdaaf4dcaefcf3893e0404d98c8aa91f89934d
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Git-commit: 383efcd00053ec40023010ce5034bd702e7ab373
CRs-Fixed: 519942
Signed-off-by: Steve Muckle <[email protected]>

sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local()

The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local()
were recently converted from BUG_ON() calls. If these hit it indicates
something is wrong and that may contribute to other system instability.
To eliminate the risk of an instance of one of these errors going
un-noticed because there was an earlier instance that occured long ago,
change to WARN_ON(). If there ever is a flood of these there are bigger
problems.

Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7
Signed-off-by: Steve Muckle <[email protected]>

sched: Convert WARN_ON() to printk_sched() in try_to_wake_up_local()

try_to_wake_up_local() is called with the rq lock held. Printing to
console in this context can result in a deadlock if klogd needs to
be woken up. Print to the kernel log buffer via printk_sched()
instead which avoids the wakeup.

Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c
Signed-off-by: Syed Rameez Mustafa <[email protected]>

sched/rt: Use root_domain of rt_rq not current processor

When the system has multiple domains do_sched_rt_period_timer()
can run on any CPU and may iterate over all rt_rq in
cpu_online_mask.  This means when balance_runtime() is run for a
given rt_rq that rt_rq may be in a different rd than the current
processor.  Thus if we use smp_processor_id() to get rd in
do_balance_runtime() we may borrow runtime from a rt_rq that is
not part of our rd.

This changes do_balance_runtime to get the rd from the passed in
rt_rq ensuring that we borrow runtime only from the correct rd
for the given rt_rq.

This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime
when we try reclaim runtime lent to other rt_rq but runtime has
been lent to a rt_rq in another rd.

Change-Id: Id8b9f88bdd783ef87bfa8d288a2fe7e86dd0b1e2
Signed-off-by: Shawn Bohrer <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Acked-by: Mike Galbraith <[email protected]>
Cc: [email protected]
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 89960feebaf4f9a53f93a0ce6888207e4a808799
Git-repo: git://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git
Signed-off-by: Syed Rameez Mustafa <[email protected]>

sched: Remove redundant update_runtime notifier

migration_call() will do all the things that update_runtime() does.
So let's remove it.

Furthermore, there is potential risk that the current code will catch
BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime
threads running because of enabling runtime twice while the rt_runtime
may already changed.

Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8
Signed-off-by: Neil Zhang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: resolved trivial header file context conflict]
Signed-off-by: Matt Wagantall <[email protected]>

sched: Fix hotplug vs. set_cpus_allowed_ptr()

Lai found that:

  WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b()
  ...
  migration_cpu_stop+0x1d/0x22

was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is
always a sub-set of cpu_online_mask.

This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness").

So set active and online at the same time to avoid this particular
problem.

CRs-Fixed: 680496
Change-Id: I89ac9b6829acf200072975bc7d028a469167f083
Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness")
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Gautham R. Shenoy <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael wang <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srivatsa S. Bhat <[email protected]>
Cc: Toshi Kani <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 24d52daafcd53f109893ec2faf668c8eeef4a382
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Matt Wagantall <[email protected]>

sched: Fix race between try_to_wake_up() and move_task()

Until a task's state has been seen as interruptible/uninterruptible
and it is no longer on_cpu, it is possible that the task may move
to another CPU (load balancing may cause this). Here is an example
where the race condition results in incorrect operation:

- cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING
- cpu 0 runs task B, which attempts to wake up A
- cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0
- cpu 1 then pulls task A (perhaps due to idle balance)
- cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE
- cpu 0 continues in try_to_wake_up(), thinking task A's previous
  cpu is 0, where it is actually 1
- if select_task_rq returns cpu 0, task A will be woken up on cpu 0
  without properly updating its cpu to 0 in set_task_cpu()

CRs-Fixed: 665958
Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99
Signed-off-by: Steve Muckle <[email protected]>

sched: Fix reference to stale task_struct in try_to_wake_up()

try_to_wake_up() currently drops p->pi_lock and later checks for need
to notify cpufreq governor on task migrations or wakeups. However the
woken task could exit between the time p->pi_lock is released and the
time the test for notification is run. As a result, the test for
notification could refer to an exited task. task_notify_on_migrate(p)
could thus lead to invalid memory reference.

Fix this by running the test for notification with task's pi_lock
held.

Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Bail out of yield_to when source and target runqueue has one task

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple handler upon
 seeing source runqueue length = 1, but it had to export rq length).
 Peter came up with the elegant idea of return -ESRCH in scheduler core.

Signed-off-by: Peter Zijlstra <[email protected]>
Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Andrew Jones <[email protected]>
Tested-by: Chegu Vinod <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>

Change-Id: I4eca8afc683c1a4b9d2490422baacd3da76c1e92

sched: Fix select_idle_sibling() bouncing cow syndrome

If the previous CPU is cache affine and idle, select it.

The current implementation simply traverses the sd_llc domain,
taking the first idle CPU encountered, which walks buddy pairs
hand in hand over the package, inflicting excruciating pain.

1 tbench pair (worst case) in a 10 core + SMT package:

  pre   15.22 MB/sec 1 procs
  post 252.01 MB/sec 1 procs

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix double normalization of vruntime

dequeue_entity() is called when p->on_rq and sets se->on_rq = 0
which appears to guarentee that the !se->on_rq condition is met.
If the task has done set_current_state(TASK_INTERRUPTIBLE) without
schedule() the second condition will be met and vruntime will be
incorrectly adjusted twice.

In certain cases this can result in the task's vruntime never increasing
past the vruntime of other tasks on the CFS' run queue, starving them of
CPU time.

This patch changes switched_from_fair() to use !p->on_rq instead of
!se->on_rq.

I'm able to cause a task with a priority of 120 to starve all other
tasks with the same priority on an ARM platform running 3.2.51-rt72
PREEMPT RT by writing one character at time to a serial tty (16550 UART)
in a tight loop. I'm also able to verify making this change corrects the
problem on that platform and kernel version.

Change-Id: I70aa0fe0e9be146b1a2a903d010b318390262dc3
Signed-off-by: George McCollister <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix divide by zero at {thread_group,task}_times

On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.

This problem can be triggered in practice on very long lived processes:

  PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
   #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
   #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
   #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
   #3 [ffff880472a51cd0] die at ffffffff8100f26b
   #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
   #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
   #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
      [exception RIP: thread_group_times+0x56]
      RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
      RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
      RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
      RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
      R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
   #8 [ffff880472a51f40] sys_times at ffffffff81088524
   #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
      RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
      RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
      RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
      RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
      R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
      ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b

Change-Id: I1e0fa1fa38b86566f5b20e5c0efa0c9eb17b203b
Signed-off-by: Stanislaw Gruszka <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched_clock: Prevent 64bit inatomicity on 32bit systems

The sched_clock_remote() implementation has the following inatomicity
problem on 32bit systems when accessing the remote scd->clock, which
is a 64bit value.

CPU0			    CPU1

sched_clock_local()	    sched_clock_remote(CPU0)
...
			    remote_clock = scd[CPU0]->clock
			    read_low32bit(scd[CPU0]->clock)
cmpxchg64(scd->clock,...)
			    read_high32bit(scd[CPU0]->clock)

While the update of scd->clock is using an atomic64 mechanism, the
readout on the remote cpu is not, which can cause completely bogus
readouts.

It is a quite rare problem, because it requires the update to hit the
narrow race window between the low/high readout and the update must go
across the 32bit boundary.

The resulting misbehaviour is, that CPU1 will see the sched_clock on
CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
to this bogus timestamp. This stays that way due to the clamping
implementation for about 4 seconds until the synchronization with
CLOCK_MONOTONIC undoes the problem.

The issue is hard to observe, because it might only result in a less
accurate SCHED_OTHER timeslicing behaviour. To create observable
damage on realtime scheduling classes, it is necessary that the bogus
update of CPU1 sched_clock happens in the context of an realtime
thread, which then gets charged 4 seconds of RT runtime, which results
in the RT throttler mechanism to trigger and prevent scheduling of RT
tasks for a little less than 4 seconds. So this is quite unlikely as
well.

The issue was quite hard to decode as the reproduction time is between
2 days and 3 weeks and intrusive tracing makes it less likely, but the
following trace recorded with trace_clock=global, which uses
sched_clock_local(), gave the final hint:

  <idle>-0   0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
  <idle>-0   0d..30 400269.477151: hrtimer_start:  hrtimer=0xf7061e80 ...
irq/20-S-587 1d..32 400273.772118: sched_wakeup:   comm= ... target_cpu=0
  <idle>-0   0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80

What happens is that CPU0 goes idle and invokes
sched_clock_idle_sleep_event() which invokes sched_clock_local() and
CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
sched_remote_clock(). The time jump gets propagated to CPU0 via
sched_remote_clock() and stays stale on both cores for ~4 seconds.

There are only two other possibilities, which could cause a stale
sched clock:

1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
   wrong value.

2) sched_clock() which reads the TSC returns a sporadic wrong value.

   one jiffy and then go stale.

   forward. It would just result in a stale sched_clock for one jiffy.

After quite some brain twisting and finding the same pattern on other
traces, sched_clock_remote() remained the only place which could cause
such a problem and as explained above it's indeed racy on 32bit
systems.

So while on 64bit systems the readout is atomic, we need to verify the
remote readout on 32bit machines. We need to protect the local->clock
readout in sched_clock_remote() on 32bit as well because an NMI could
hit between the low and the high readout, call sched_clock_local() and
modify local->clock.

Thanks to Siegfried Wulsch for bearing with my debug requests and
going through the tedious tasks of running a bunch of reproducer
systems to generate the debug information which let me decode the
issue.

Change-Id: I767a6a645e17651b39d0a530dc0681756f3b4f5d
Reported-by: Siegfried Wulsch <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities

This patch touches the RT group scheduling case.

Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
priority, while rt_rq passed to them may be not the top-level rt_rq.
This is wrong, because changing of priority on a child level does not
guarantee that the priority is the highest all over the rq. So, this
leak makes RT balancing unusable.

The short example: the task having the highest priority among all rq's
RT tasks (no one other task has the same priority) are waking on a
throttle rt_rq.  The rq's cpupri is set to the task's priority
equivalent, but real rq->rt.highest_prio.curr is less.

The patch below fixes the problem.

Change-Id: I01d3cc07870b1825efa1deed3b70f060ac20939f
Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
CC: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Guarantee new group-entities always have weight

Currently, group entity load-weights are initialized to zero. This
admits some races with respect to the first time they are re-weighted in
earlty use. ( Let g[x] denote the se for "g" on cpu "x". )

Suppose that we have root->a and that a enters a throttled state,
immediately followed by a[0]->t1 (the only task running on cpu[0])
blocking:

  put_prev_task(group_cfs_rq(a[0]), t1)
  put_prev_entity(..., t1)
  check_cfs_rq_runtime(group_cfs_rq(a[0]))
  throttle_cfs_rq(group_cfs_rq(a[0]))

Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
time:

  enqueue_task_fair(rq[0], t2)
  enqueue_entity(group_cfs_rq(b[0]), t2)
  enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
  account_entity_enqueue(group_cfs_ra(b[0]), t2)
  update_cfs_shares(group_cfs_rq(b[0]))
  < skipped because b is part of a throttled hierarchy >
  enqueue_entity(group_cfs_rq(a[0]), b[0])
  ...

We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
which violates invariants in several code-paths. Eliminate the
possibility of this by initializing group entity weight.

Change-Id: I1b09c5c4f85df1de565e7c862ba8dde65851c05f
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Chris J Arges <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Fix isolated CPUs leaving root_task_group indefinitely throttled

Root task group bandwidth replenishment must service all CPUs, regardless of
where the timer was last started, and regardless of the isolation mechanism,
lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.

Change-Id: I8485724e3ff0960008aa13634e55bcb6ceacbc01
Signed-off-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Avoid updating RT entry timeout twice within one tick period

The issue below was found in 2.6.34-rt rather than mainline rt kernel, but
the issue still exists upstream as well.

So please let me describe how it was noticed on 2.6.34-rt:

On this version, each softirq has its own thread, which means there is at
least one RT FIFO task per cpu.  The priority of these tasks is set to 49
by default.  If user launches an RT FIFO task with priority lower than 49
of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued
one cpu runqueue at one moment. By current strategy of balancing RT tasks,
when it comes to RT tasks, we really need to put them off to a CPU that
they can run on as soon as possible. Even if it means a bit of cache line
flushing, we want RT tasks to be run with the least latency.

When the user RT FIFO task which just launched before is running, the
sched timer tick of the current cpu happens. In this tick period, the
timeout value of the user RT task will be updated once. Subsequently,
we try to wake up one softirq RT task on its local cpu. As the priority
of current user RT task is lower than the softirq RT task, the current
task will be preempted by the higher priority softirq RT task.  Before
preemption, we check to see if current can readily move to a different
cpu. If so, we will reschedule to allow the RT push logic to try to move
current somewhere else. Whenever the woken softirq RT task runs, it first
tries to migrate the user FIFO RT task over to a cpu that is running a
task of lesser priority. If migration is done, it will send a reschedule
request to the found cpu by IPI interrupt. Once the target cpu responds
the IPI interrupt, it will pick the migrated user RT task to preempt its
current task. When the user RT task is running on the new cpu, the sched
timer tick of the cpu fires. So it will tick the user RT task again. This
also means the RT task timeout value will be updated again.  As the
migration may be done in one tick period, it means the user RT task
timeout value will be updated twice within one tick.

If we set a limit on the amount of cpu time for the user RT task by
setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon
reaching the soft limit.

But exactly when the SIGXCPU signal should be sent depends on the RT task
timeout value. In fact the timeout mechanism of sending the SIGXCPU signal
assumes the RT task timeout is increased once every tick.

However, currently the timeout value may be added twice per tick. So it
results in the SIGXCPU signal being sent earlier than expected.

To solve this issue, we prevent the timeout value from increasing twice
within one tick time by remembering the jiffies value of last updating the
timeout. As long as the RT task's jiffies is different with the global
jiffies value, we allow its timeout to be updated.

Signed-off-by: Ying Xue <[email protected]>
Signed-off-by: Fan Du <[email protected]>
Reviewed-by: Yong Zhang <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
[ lizf: backported to 3.4: adjust context ]
Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Change-Id: I1d126959d88f3bc1adb4a707661efb7162048424

sched: rt: Reduce rq lock contention by eliminating locking of non-feasible target

This patch adds checks that prevens futile attempts to move rt tasks to a
CPU with active tasks of equal or higher priority.

This reduces run queue lock contention and improves the performance of a
well known OLTP benchmark by 0.7%.

Change-Id: I232abe50623593d51be508fc64e8b865dcc77212
Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Shawn Bohrer <[email protected]>
Cc: Suruchi Kadu <[email protected]>
Cc: Doug Nelson<[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

sched: Fix RLIMIT_RTTIME when PI-boosting to RT

When non-realtime tasks get priority-inheritance boosted to a realtime
scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
counter used for checking this (the same one used for SCHED_RR
timeslices) was not getting reset. This meant that tasks running with a
non-realtime scheduling class which are repeatedly boosted to a realtime
one, but never block while they are running realtime, eventually hit the
timeout without ever running for a time over the limit. This patch
resets the realtime timeslice counter when un-PI-boosting from an RT to
a non-RT scheduling class.

I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
mutex which induces priority boosting and spins while boosted that gets
killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
does happen eventually with PREEMPT_VOLUNTARY kernels.

Change-Id: I732036119517d9a9625cea4644b4509c1d6cdf13
Signed-off-by: Brian Silverman <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
[lizf: Backported to 3.4: adjust contest]
Signed-off-by: Zefan Li <[email protected]>

sched: Queue RT tasks to head when prio drops

The following scenario does not work correctly:

Runqueue of CPUx contains two runnable and pinned tasks:

 T1: SCHED_FIFO, prio 80
 T2: SCHED_FIFO, prio 80

T1 is on the cpu and executes the following syscalls (classic priority
ceiling scenario):

 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90);
 ...
 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80);
 ...

Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back
to sleep the scheduler picks T2. Surprise!

The same happens w/o actual preemption when T1 is forced into the
scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes
pick_next_task() which returns T2. So T1 gets preempted and scheduled
out.

This happens because sched_setscheduler() dequeues T1 from the prio 90
list and then enqueues it on the tail of the prio 80 list behind T2.
This violates the POSIX spec and surprises user space which relies on
the guarantee that SCHED_FIFO tasks are not scheduled out unless they
give the CPU up voluntarily or are preempted by a higher priority
task. In the latter case the preempted task must get back on the CPU
after the preempting task schedules out again.

We fixed a similar issue already in commit 60db48c (sched: Queue a
deboosted task to the head of the RT prio queue). The same treatment
is necessary for sched_setscheduler(). So enqueue to head of the prio
bucket list if the priority of the task is lowered.

It might be possible that existing user space relies on the current
behaviour, but it can be considered highly unlikely due to the corner
case nature of the application scenario.

Change-Id: I65f159b3c2d265dcaf27b70bf9be88cf922ab845
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

sched: Fix the theoretical signal_wake_up() vs schedule() race

This is only theoretical, but after try_to_wake_up(p) was changed to check
p->state under p->pi_lock the code like:

	__set_current_state(TASK_INTERRUPTIBLE);
	schedule();

can miss a signal.  This is the special case of wait-for-condition, it
relies on try_to_wake_up/schedule interaction and thus it does not need
mb() between __set_current_state() and if (signal_pending).

However, this __set_current_state() can move into the critical section
protected by rq->lock, now that try_to_wake_up() takes another lock we
need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.

The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock).  This is what try_to_wake_up() already
does by the same reason.

We turn this wmb() into the new helper, smp_mb__before_spinlock(), for
better documentation and to allow the architectures to change the default
implementation.

While at it, kill smp_mb__after_lock(), it has no callers.

Perhaps we can also add smp_mb__before/after_spinunlock() for
prepare_to_wait().

Change-Id: I2b329b2ae213cc9221811ca244d2ebb044011a1b
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix race in idle_balance()

The scheduler main function 'schedule()' checks if there are no more tasks
on the runqueue. Then it checks if a task should be pulled in the current
runqueue in idle_balance() assuming it will go to idle otherwise.

But idle_balance() releases the rq->lock in order to look up the sched
domains and takes the lock again right after. That opens a window where
another cpu may put a task in our runqueue, so we won't go to idle but
we have filled the idle_stamp, thinking we will.

This patch closes the window by checking if the runqueue has been modified
but without pulling a task after taking the lock again, so we won't go to idle
right after in the __schedule() function.

Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8
Signed-off-by: Daniel Lezcano <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix load avg vs cpu-hotplug

Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.

Reported-by: Rakib Mullick <[email protected]>
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix load avg vs. cpu-hotplug

Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an
incomplete fix:

In particular, the problem is that at the point it calls
calc_load_migrate() nr_running := 1 (the stopper thread), so move the
call to CPU_DEAD where we're sure that nr_running := 0.

Also note that we can call calc_load_migrate() without serialization, we
know the state of rq is stable since its cpu is dead, and we modify the
global state using appropriate atomic ops.

Change-Id: I24063eebf4429d7e01838af2cb68e0f3c2cf2131
Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>

sched/nohz: Rewrite and fix load-avg computation -- again

Thanks to Charles Wang for spotting the defects in the current code:

 - If we go idle during the sample window -- after sampling, we get a
   negative bias because we can negate our own sample.

 - If we wake up during the sample window we get a positive bias
   because we push the sample to a known active period.

So rewrite the entire nohz load-avg muck once again, now adding
copious documentation to the code.

Reported-and-tested-by: Doug Smythies <[email protected]>
Reported-and-tested-by: Charles Wang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
[ minor edits ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Change-Id: Ib735002c79eaace9fa5f7795fdc2f80b638b56a7

sched: Fix the broken sched_rr_get_interval()

The caller of sched_sliced() should pass se.cfs_rq and se as the
arguments, however in sched_rr_get_interval() we gave it
rq.cfs_rq and se, which made the following computation obviously
wrong.

The change was introduced by commit:

  77034937dc45 sched: fix crash in sys_sched_rr_get_interval()

... 5 years ago, while it had been the correct 'cfs_rq_of' before
the commit. The change seems to be irrelevant to the commit
msg, which was to return a 0 timeslice for tasks that are on an
idle runqueue. So I believe that was just a plain typo.

Signed-off-by: Zhu Yanhai <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Since this is an ABI and an old bug, we'll test this via a
  slow upstream route, to hopefully discover any app breakage. ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix race in task_group()

Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call
task_group() too many times in set_task_rq()").  He found the reason to be
that the multiple task_group() invocations in set_task_rq() returned
different values.

Looking at all that I found a lack of serialization and plain wrong
comments.

The below tries to fix it using an extra pointer which is updated under
the appropriate scheduler locks. Its not pretty, but I can't really see
another way given how all the cgroup stuff works.

Change-Id: I6906c92ea6f66d19af5938a9f5ec9f1ff4577c39
Reported-and-tested-by: Stefan Bader <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume

In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.

However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.

In order to achieve that, do the following:

1. Don't modify cpusets during suspend/resume. At all.
   In particular, don't move the tasks from one cpuset to another, and
   don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
   during the CPU hotplug operations that are carried out in the
   suspend/resume path.

2. However, cpusets and sched domains are related. We just want to avoid
   altering cpusets alone. So, to keep the sched domains updated, build
   a single sched domain (containing all active cpus) during each of the
   CPU hotplug operations carried out in s/r path, effectively ignoring
   the cpusets' cpus_allowed masks.

   (Since userspace is frozen while doing all this, it will go unnoticed.)

3. During the last CPU online operation during resume, build the sched
   domains by looking up the (unaltered) cpusets' cpus_allowed masks.
   That will bring back the system to the same original state as it was in
   before suspend.

Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.

Change-Id: I12c3e6289d5aa069531f8f616d63412b38103d99
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Preeti U Murthy <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones

There is a small race between copy_process() and cgroup_attach_task()
where child->se.parent,cfs_rq points to invalid (old) ones.

        parent doing fork()      | someone moving the parent to another cgroup
  -------------------------------+---------------------------------------------
    copy_process()
      + dup_task_struct()
        -> parent->se is copied to child->se.
           se.parent,cfs_rq of them point to old ones.

                                     cgroup_attach_task()
                                       + cgroup_task_migrate()
                                         -> parent->cgroup is updated.
                                       + cpu_cgroup_attach()
                                         + sched_move_task()
                                           + task_move_group_fair()
                                             +- set_task_rq()
                                                -> se.parent,cfs_rq of parent
                                                   are updated.

      + cgroup_fork()
        -> parent->cgroup is copied to child->cgroup. (*1)
      + sched_fork()
        + task_fork_fair()
          -> se.parent,cfs_rq of child are accessed
             while they point to old ones. (*2)

In the worst case, this bug can lead to "use-after-free" and cause a panic,
because it's new cgroup's refcount that is incremented at (*1),
so the old cgroup(and related data) can be freed before (*2).

In fact, a panic caused by this bug was originally caught in RHEL6.4.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0
    [...]
    Call Trace:
     [<ffffffff81051f25>] place_entity+0x75/0xa0
     [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160
     [<ffffffff81063c0b>] sched_fork+0x6b/0x140
     [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450
     [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130
     [<ffffffff8106d2f4>] do_fork+0x94/0x460
     [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100
     [<ffffffff81009598>] sys_clone+0x28/0x30
     [<ffffffff8100b393>] stub_clone+0x13/0x20
     [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

Signed-off-by: Daisuke Nishimura <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/nohz: Fix rq->cpu_load[] calculations

While investigating why the load-balancer did funny I found that the
rq->cpu_load[] tables were completely screwy.. a bit more digging
revealed that the updates that got through were missing ticks followed
by a catchup of 2 ticks.

The catchup assumes the cpu was idle during that time (since only nohz
can cause missed ticks and the machine is idle etc..) this means that
esp. the higher indices were significantly lower than they ought to
be.

The reason for this is that its not correct to compare against jiffies
on every jiffy on any other cpu than the cpu that updates jiffies.

This patch cludges around it by only doing the catch-up stuff from
nohz_idle_balance() and doing the regular stuff unconditionally from
the tick.

Change-Id: Iac6eda1ded5074afb261ccc0fe8a4e70432bb031
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Venkatesh Pallipadi <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/nohz: Fix rq->cpu_load calculations some more

commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream.

Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[]
calculations") since while that fixed the busy case it regressed the
mostly idle case.

Add a callback from the nohz exit to also age the rq->cpu_load[]
array. This closes the hole where either there was no nohz load
balance pass during the nohz, or there was a 'significant' amount of
idle time between the last nohz balance and the nohz exit.

So we'll update unconditionally from the tick to not insert any
accidental 0 load periods while busy, and we try and catch up from
nohz idle balance and nohz exit. Both these are still prone to missing
a jiffy, but that has always been the case.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Venkatesh Pallipadi <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/core: Clear the root_domain cpumasks in init_rootdomain()

root_domain::rto_mask allocated through alloc_cpumask_var() contains
garbage data, and this may cause problems.  For instance, when doing
pull_rt_task(), it may do useless iterations if rto_mask retains some
extra garbage bits. Worse still, this violates the isolated domain rule
for clustered scheduling using cpuset, because the tasks (with all the
cpus allowed) belongs to one root domain can be pulled away into another
root domain.

The patch cleans the garbage by using zalloc_cpumask_var() instead of
alloc_cpumask_var() for root_domain::rto_mask allocation, thereby
addressing the issues.

Do the same thing for root_domain's other cpumask memembers:
- span,
- online;

Signed-off-by: Xunlei Pang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

softirq: Reduce latencies

In various network workloads, __do_softirq() latencies can be up to 20 ms
if HZ=1000, and 200 ms if HZ=100.

This is because we iterate 10 times in the softirq dispatcher, and some
actions can consume a lot of cycles.

This patch changes the fallback to ksoftirqd condition to:
- A time limit of 2 ms,
- need_resched() being set on current task;

When one of this condition is met, we wakeup ksoftirqd for further softirq
processing if we still have pending softirqs.

Using need_resched() as the only condition can trigger RCU stalls, as we
can keep BH disabled for too long.

I ran several benchmarks and got no significant difference in throughput,
but a very significant reduction of latencies (one order of magnitude):

In following bench, 200 antagonist "netperf -t TCP_RR" are started in
background, using all available cpus.

Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC
IRQ (hard + soft).

Before patch:

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=550110.424
MIN_LATENCY=146858
MAX_LATENCY=997109
P50_LATENCY=305000
P90_LATENCY=550000
P99_LATENCY=710000
MEAN_LATENCY=376989.12
STDDEV_LATENCY=184046.92

After patch:

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=40545.492
MIN_LATENCY=9834
MAX_LATENCY=78366
P50_LATENCY=33583
P90_LATENCY=59000
P99_LATENCY=69000
MEAN_LATENCY=38364.67
STDDEV_LATENCY=12865.26

Signed-off-by: Eric Dumazet <[email protected]>
Cc: David Miller <[email protected]>
Cc: Tom Herbert <[email protected]>
Cc: Ben Hutchings <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

softirq: Fix lockup related to stop_machine being stuck in __do_softirq.

The stop machine logic can lock up if all but one of the migration threads
make it through the disable-irq step and the one remaining thread gets
stuck in __do_softirq. The reason __do_softirq can hang is that it has a
bail-out based on jiffies timeout, but in the lockup case, jiffies itself
is not incremented.

To work around this, re-add the max_restart counter in __do_irq and stop
processing irqs after 10 restarts.

Thanks to Tejun Heo and Rusty Russell and others for helping me track this
down.

This was introduced in 3.9 by commit c10d73671ad3
("softirq: reduce latencies").

It may be worth looking into ath9k to see if it has issues with its irq
handler at a later date.

The hang stack traces look something like this:

    ------------[ cut here ]------------
    WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7()
    Watchdog detected hard LOCKUP on cpu 2
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    Pid: 23, comm: migration/2 Tainted: G         C   3.9.4+ #11
    Call Trace:
     <NMI>   warn_slowpath_common+0x85/0x9f
      warn_slowpath_fmt+0x46/0x48
      watchdog_overflow_callback+0x9c/0xa7
      __perf_event_overflow+0x137/0x1cb
      perf_event_overflow+0x14/0x16
      intel_pmu_handle_irq+0x2dc/0x359
      perf_event_nmi_handler+0x19/0x1b
      nmi_handle+0x7f/0xc2
      do_nmi+0xbc/0x304
      end_repeat_nmi+0x1e/0x2e
     <<EOE>>
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0
    ---[ end trace 4947dfa9b0a4cec3 ]---
    BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17]
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    irq event stamp: 835637905
    hardirqs last  enabled at (835637904): __do_softirq+0x9f/0x257
    hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80
    softirqs last  enabled at (5654720): __do_softirq+0x1ff/0x257
    softirqs last disabled at (5654725): irq_exit+0x5f/0xbb
    CPU 1
    Pid: 17, comm: migration/1 Tainted: G        WC   3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
    RIP: tasklet_hi_action+0xf0/0xf0
    Process migration/1
    Call Trace:
     <IRQ>
      __do_softirq+0x117/0x257
      irq_exit+0x5f/0xbb
      smp_apic_timer_interrupt+0x8a/0x98
      apic_timer_interrupt+0x72/0x80
     <EOI>
      printk+0x4d/0x4f
      stop_machine_cpu_stop+0x22c/0x274
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0

Change-Id: Ib213d5a5c59c85ae16afde6f42a01e39d2c93fd7
Signed-off-by: Ben Greear <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Pekka Riikonen <[email protected]>
Cc: Eric Dumazet <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <[email protected]>
Signed-off-by: Zefan Li <[email protected]>
alesaiko pushed a commit that referenced this issue Mar 12, 2018
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior
of waking their tasks up on idle CPUs. The feature has too much of a
negative impact on other workloads however to apply globally. The
PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this
flag set, or tasks woken by tasks with this flag set, on an idle CPU
if one is available.

Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039
Signed-off-by: Steve Muckle <[email protected]>

sched: add sysctl for controlling task migrations on wake

The PF_WAKE_UP_IDLE per-task flag made it impossible to enable
the old behavior of SD_SHARE_PKG_RESOURCES, where every task
migrates to an idle CPU on wakeup.

The sched_wake_to_idle sysctl value, when made nonzero, will cause
all tasks to migrate to an idle CPU if one is available when the
task is woken up. This is regardless of how PF_WAKE_UP_IDLE is
configured for tasks in the system. Similar to PF_WAKE_UP_IDLE,
the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled
for the sysctl value to have an effect.

Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61
Signed-off-by: Steve Muckle <[email protected]>
(cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282)

Conflicts:

	kernel/sched/fair.c

sched: Reset rq->next_interval before going idle

next_balance, the point in jiffy time scale when a cpu will next load
balance, could have been calculated when the cpu was busy. A busy cpu
will apply its sched domain's busy_factor (usually > 1) in computing
next_balance for that sched domain, which causes the (busy) cpu to load
balance less frequently in its sched domains. However when the same cpu
is going idle, its next_balance needs to be reset without consideration
of busy_factor. Failure to do so would not trigger nohz idle balancer on
that cpu for unnecessarily long time (introducing additional scheduling
latencies for tasks). Fix bug in scheduler which aims to reset
next_balance before a cpu goes idle (as per existing comment) but is
clearly not doing so.

Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
(just when they would have exceeded their ideal_runtime). It makes use
of a a per-cpu hrtimer resource and hence alarming that hrtimer should
be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs,
rather than being based on number of tasks in a particular cfs_rq (as
implemented currently). As a result, with current code, its possible for
a running task (which is the sole task in its cfs_rq) to be preempted
much after its ideal_runtime has elapsed, resulting in increased latency
for tasks in other cfs_rq on same cpu.

Fix this by alarming sched hrtimer based on total number of SCHED_FAIR
tasks a CPU has across its various cfs_rqs.

Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Mark schedule_io_timeout() with EXPORT_SYMBOL

Make schedule_io_timeout() visible to modules.

Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2
Signed-off-by: Jordan Crouse <[email protected]>

sched: fix reference to wrong cfs_rq

Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption
of tasks) introduced a bug in sched_slice() calculation by using wrong
cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather
than the correct one to which they belonged.

Fix the bug by using correct cfs_rq for tasks.

Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Make sure to not re-read variables after validation

We could re-read rq->rt_avg after we validated it was smaller than
total, invalidating the check and resulting in an unintended negative.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: David Rientjes <[email protected]>
Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
CRs-Fixed: 497236
Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
(cherry picked from commit fef154fa32c5d8ef992f11e11919495979970678)

Change-Id: Ieac7a10d9f4ceb701660f8055efd7a0ed5d939ca
Signed-off-by: Sridhar Nelloru <[email protected]>

sched: re-calculate a cpu's next_balance point upon sched domain changes

Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset
a cpu's rq->next_balance when pulled_task = 0, which will be true when
the cpu failed to pull any task, causing it go idle. However that patch
relied on next_balance being calculated as a result of traversing cpu's
sched domain hierarchy.

A cpu that is the only online cpu will however not be attached to any
sched domain hierarchy. When such a cpu calls into idle_balance(), we
will end up initializing next_balance to be 1sec away! Such a CPU will
defer load balance check for another 1sec, even though we may bring up
more cpus in the meantime requiring it to check for load imbalance more
frequently. This could then lead to increased scheduling latency for
some tasks.

This patch results in a cpu's next_balance being re-calculated when its
attaching to a new sched domain hierarchy.  This should let cpus call
load balance checks at the right time we expect them to!

Signed-off-by: Srivatsa Vaddagiri <[email protected]>
(cherry picked from commit 91b609bdd91513c37c410ffef0204f8a0578e4f8)

Change-Id: Ib457f1a357aad26e49837a01311b73e22e0cc26f
Signed-off-by: Sridhar Nelloru <[email protected]>

sched: Unthrottle rt runqueues in __disable_runtime()

migrate_tasks() uses _pick_next_task_rt() to get tasks from the
real-time runqueues to be migrated. When rt_rq is throttled
_pick_next_task_rt() won't return anything, in which case
migrate_tasks() can't move all threads over and gets stuck in an
infinite loop.

Instead unthrottle rt runqueues before migrating tasks.

Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()

Signed-off-by: Peter Boonstoppel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix clear NOHZ_BALANCE_KICK

I have faced a sequence where the Idle Load Balance was sometime not
triggered for a while on my platform, in the following scenario:

 CPU 0 and CPU 1 are running tasks and CPU 2 is idle

 CPU 1 kicks the Idle Load Balance
 CPU 1 selects CPU 2 as the new Idle Load Balancer
 CPU 2 sets NOHZ_BALANCE_KICK for CPU 2
 CPU 2 sends a reschedule IPI to CPU 2

 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2

 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
       task A quickly goes back to sleep (before a tick occurs on CPU 2)
 CPU 2 goes back to idle with NOHZ_BALANCE_KICK set

Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent
because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be
performed.

We must wait for the sched softirq to be raised on CPU 2 thanks to another
part the kernel to come back to clear NOHZ_BALANCE_KICK.

The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
we can't raise the sched_softirq for the Idle Load Balance.

Change since V1:

- move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB
  can't run on this CPU (as suggested by Peter)

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 873b4c65b519fd769940eb281f77848227d4e5c1
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: minor merge resolution for 3.4 in scheduler_ipi()]
Signed-off-by: Steve Muckle <[email protected]>
Change-Id: I3548612057cccc2ecc29429c129c44183083831f

sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s

try_to_wake_up_local() should only be invoked to wake up another
task in the same runqueue and BUG_ON()s are used to enforce the
rule. Missing try_to_wake_up_local() can stall workqueue
execution but such stalls are likely to be finite either by
another work item being queued or the one blocked getting
unblocked.  There's no reason to trigger BUG while holding rq
lock crashing the whole system.

Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s.

Change-Id: I75fdaaf4dcaefcf3893e0404d98c8aa91f89934d
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Git-commit: 383efcd00053ec40023010ce5034bd702e7ab373
CRs-Fixed: 519942
Signed-off-by: Steve Muckle <[email protected]>

sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local()

The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local()
were recently converted from BUG_ON() calls. If these hit it indicates
something is wrong and that may contribute to other system instability.
To eliminate the risk of an instance of one of these errors going
un-noticed because there was an earlier instance that occured long ago,
change to WARN_ON(). If there ever is a flood of these there are bigger
problems.

Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7
Signed-off-by: Steve Muckle <[email protected]>

sched: Convert WARN_ON() to printk_sched() in try_to_wake_up_local()

try_to_wake_up_local() is called with the rq lock held. Printing to
console in this context can result in a deadlock if klogd needs to
be woken up. Print to the kernel log buffer via printk_sched()
instead which avoids the wakeup.

Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c
Signed-off-by: Syed Rameez Mustafa <[email protected]>

sched/rt: Use root_domain of rt_rq not current processor

When the system has multiple domains do_sched_rt_period_timer()
can run on any CPU and may iterate over all rt_rq in
cpu_online_mask.  This means when balance_runtime() is run for a
given rt_rq that rt_rq may be in a different rd than the current
processor.  Thus if we use smp_processor_id() to get rd in
do_balance_runtime() we may borrow runtime from a rt_rq that is
not part of our rd.

This changes do_balance_runtime to get the rd from the passed in
rt_rq ensuring that we borrow runtime only from the correct rd
for the given rt_rq.

This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime
when we try reclaim runtime lent to other rt_rq but runtime has
been lent to a rt_rq in another rd.

Change-Id: Id8b9f88bdd783ef87bfa8d288a2fe7e86dd0b1e2
Signed-off-by: Shawn Bohrer <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Acked-by: Mike Galbraith <[email protected]>
Cc: [email protected]
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 89960feebaf4f9a53f93a0ce6888207e4a808799
Git-repo: git://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git
Signed-off-by: Syed Rameez Mustafa <[email protected]>

sched: Remove redundant update_runtime notifier

migration_call() will do all the things that update_runtime() does.
So let's remove it.

Furthermore, there is potential risk that the current code will catch
BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime
threads running because of enabling runtime twice while the rt_runtime
may already changed.

Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8
Signed-off-by: Neil Zhang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: resolved trivial header file context conflict]
Signed-off-by: Matt Wagantall <[email protected]>

sched: Fix hotplug vs. set_cpus_allowed_ptr()

Lai found that:

  WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b()
  ...
  migration_cpu_stop+0x1d/0x22

was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is
always a sub-set of cpu_online_mask.

This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness").

So set active and online at the same time to avoid this particular
problem.

CRs-Fixed: 680496
Change-Id: I89ac9b6829acf200072975bc7d028a469167f083
Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness")
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Gautham R. Shenoy <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael wang <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srivatsa S. Bhat <[email protected]>
Cc: Toshi Kani <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 24d52daafcd53f109893ec2faf668c8eeef4a382
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Matt Wagantall <[email protected]>

sched: Fix race between try_to_wake_up() and move_task()

Until a task's state has been seen as interruptible/uninterruptible
and it is no longer on_cpu, it is possible that the task may move
to another CPU (load balancing may cause this). Here is an example
where the race condition results in incorrect operation:

- cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING
- cpu 0 runs task B, which attempts to wake up A
- cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0
- cpu 1 then pulls task A (perhaps due to idle balance)
- cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE
- cpu 0 continues in try_to_wake_up(), thinking task A's previous
  cpu is 0, where it is actually 1
- if select_task_rq returns cpu 0, task A will be woken up on cpu 0
  without properly updating its cpu to 0 in set_task_cpu()

CRs-Fixed: 665958
Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99
Signed-off-by: Steve Muckle <[email protected]>

sched: Fix reference to stale task_struct in try_to_wake_up()

try_to_wake_up() currently drops p->pi_lock and later checks for need
to notify cpufreq governor on task migrations or wakeups. However the
woken task could exit between the time p->pi_lock is released and the
time the test for notification is run. As a result, the test for
notification could refer to an exited task. task_notify_on_migrate(p)
could thus lead to invalid memory reference.

Fix this by running the test for notification with task's pi_lock
held.

Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Bail out of yield_to when source and target runqueue has one task

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple handler upon
 seeing source runqueue length = 1, but it had to export rq length).
 Peter came up with the elegant idea of return -ESRCH in scheduler core.

Signed-off-by: Peter Zijlstra <[email protected]>
Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Andrew Jones <[email protected]>
Tested-by: Chegu Vinod <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>

Change-Id: I4eca8afc683c1a4b9d2490422baacd3da76c1e92

sched: Fix select_idle_sibling() bouncing cow syndrome

If the previous CPU is cache affine and idle, select it.

The current implementation simply traverses the sd_llc domain,
taking the first idle CPU encountered, which walks buddy pairs
hand in hand over the package, inflicting excruciating pain.

1 tbench pair (worst case) in a 10 core + SMT package:

  pre   15.22 MB/sec 1 procs
  post 252.01 MB/sec 1 procs

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix double normalization of vruntime

dequeue_entity() is called when p->on_rq and sets se->on_rq = 0
which appears to guarentee that the !se->on_rq condition is met.
If the task has done set_current_state(TASK_INTERRUPTIBLE) without
schedule() the second condition will be met and vruntime will be
incorrectly adjusted twice.

In certain cases this can result in the task's vruntime never increasing
past the vruntime of other tasks on the CFS' run queue, starving them of
CPU time.

This patch changes switched_from_fair() to use !p->on_rq instead of
!se->on_rq.

I'm able to cause a task with a priority of 120 to starve all other
tasks with the same priority on an ARM platform running 3.2.51-rt72
PREEMPT RT by writing one character at time to a serial tty (16550 UART)
in a tight loop. I'm also able to verify making this change corrects the
problem on that platform and kernel version.

Change-Id: I70aa0fe0e9be146b1a2a903d010b318390262dc3
Signed-off-by: George McCollister <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix divide by zero at {thread_group,task}_times

On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.

This problem can be triggered in practice on very long lived processes:

  PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
   #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
   #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
   #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
   #3 [ffff880472a51cd0] die at ffffffff8100f26b
   #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
   #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
   #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
      [exception RIP: thread_group_times+0x56]
      RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
      RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
      RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
      RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
      R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
   #8 [ffff880472a51f40] sys_times at ffffffff81088524
   #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
      RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
      RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
      RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
      RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
      R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
      ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b

Change-Id: I1e0fa1fa38b86566f5b20e5c0efa0c9eb17b203b
Signed-off-by: Stanislaw Gruszka <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched_clock: Prevent 64bit inatomicity on 32bit systems

The sched_clock_remote() implementation has the following inatomicity
problem on 32bit systems when accessing the remote scd->clock, which
is a 64bit value.

CPU0			    CPU1

sched_clock_local()	    sched_clock_remote(CPU0)
...
			    remote_clock = scd[CPU0]->clock
			    read_low32bit(scd[CPU0]->clock)
cmpxchg64(scd->clock,...)
			    read_high32bit(scd[CPU0]->clock)

While the update of scd->clock is using an atomic64 mechanism, the
readout on the remote cpu is not, which can cause completely bogus
readouts.

It is a quite rare problem, because it requires the update to hit the
narrow race window between the low/high readout and the update must go
across the 32bit boundary.

The resulting misbehaviour is, that CPU1 will see the sched_clock on
CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
to this bogus timestamp. This stays that way due to the clamping
implementation for about 4 seconds until the synchronization with
CLOCK_MONOTONIC undoes the problem.

The issue is hard to observe, because it might only result in a less
accurate SCHED_OTHER timeslicing behaviour. To create observable
damage on realtime scheduling classes, it is necessary that the bogus
update of CPU1 sched_clock happens in the context of an realtime
thread, which then gets charged 4 seconds of RT runtime, which results
in the RT throttler mechanism to trigger and prevent scheduling of RT
tasks for a little less than 4 seconds. So this is quite unlikely as
well.

The issue was quite hard to decode as the reproduction time is between
2 days and 3 weeks and intrusive tracing makes it less likely, but the
following trace recorded with trace_clock=global, which uses
sched_clock_local(), gave the final hint:

  <idle>-0   0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
  <idle>-0   0d..30 400269.477151: hrtimer_start:  hrtimer=0xf7061e80 ...
irq/20-S-587 1d..32 400273.772118: sched_wakeup:   comm= ... target_cpu=0
  <idle>-0   0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80

What happens is that CPU0 goes idle and invokes
sched_clock_idle_sleep_event() which invokes sched_clock_local() and
CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
sched_remote_clock(). The time jump gets propagated to CPU0 via
sched_remote_clock() and stays stale on both cores for ~4 seconds.

There are only two other possibilities, which could cause a stale
sched clock:

1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
   wrong value.

2) sched_clock() which reads the TSC returns a sporadic wrong value.

   one jiffy and then go stale.

   forward. It would just result in a stale sched_clock for one jiffy.

After quite some brain twisting and finding the same pattern on other
traces, sched_clock_remote() remained the only place which could cause
such a problem and as explained above it's indeed racy on 32bit
systems.

So while on 64bit systems the readout is atomic, we need to verify the
remote readout on 32bit machines. We need to protect the local->clock
readout in sched_clock_remote() on 32bit as well because an NMI could
hit between the low and the high readout, call sched_clock_local() and
modify local->clock.

Thanks to Siegfried Wulsch for bearing with my debug requests and
going through the tedious tasks of running a bunch of reproducer
systems to generate the debug information which let me decode the
issue.

Change-Id: I767a6a645e17651b39d0a530dc0681756f3b4f5d
Reported-by: Siegfried Wulsch <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities

This patch touches the RT group scheduling case.

Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
priority, while rt_rq passed to them may be not the top-level rt_rq.
This is wrong, because changing of priority on a child level does not
guarantee that the priority is the highest all over the rq. So, this
leak makes RT balancing unusable.

The short example: the task having the highest priority among all rq's
RT tasks (no one other task has the same priority) are waking on a
throttle rt_rq.  The rq's cpupri is set to the task's priority
equivalent, but real rq->rt.highest_prio.curr is less.

The patch below fixes the problem.

Change-Id: I01d3cc07870b1825efa1deed3b70f060ac20939f
Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
CC: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Guarantee new group-entities always have weight

Currently, group entity load-weights are initialized to zero. This
admits some races with respect to the first time they are re-weighted in
earlty use. ( Let g[x] denote the se for "g" on cpu "x". )

Suppose that we have root->a and that a enters a throttled state,
immediately followed by a[0]->t1 (the only task running on cpu[0])
blocking:

  put_prev_task(group_cfs_rq(a[0]), t1)
  put_prev_entity(..., t1)
  check_cfs_rq_runtime(group_cfs_rq(a[0]))
  throttle_cfs_rq(group_cfs_rq(a[0]))

Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
time:

  enqueue_task_fair(rq[0], t2)
  enqueue_entity(group_cfs_rq(b[0]), t2)
  enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
  account_entity_enqueue(group_cfs_ra(b[0]), t2)
  update_cfs_shares(group_cfs_rq(b[0]))
  < skipped because b is part of a throttled hierarchy >
  enqueue_entity(group_cfs_rq(a[0]), b[0])
  ...

We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
which violates invariants in several code-paths. Eliminate the
possibility of this by initializing group entity weight.

Change-Id: I1b09c5c4f85df1de565e7c862ba8dde65851c05f
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Chris J Arges <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Fix isolated CPUs leaving root_task_group indefinitely throttled

Root task group bandwidth replenishment must service all CPUs, regardless of
where the timer was last started, and regardless of the isolation mechanism,
lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.

Change-Id: I8485724e3ff0960008aa13634e55bcb6ceacbc01
Signed-off-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Avoid updating RT entry timeout twice within one tick period

The issue below was found in 2.6.34-rt rather than mainline rt kernel, but
the issue still exists upstream as well.

So please let me describe how it was noticed on 2.6.34-rt:

On this version, each softirq has its own thread, which means there is at
least one RT FIFO task per cpu.  The priority of these tasks is set to 49
by default.  If user launches an RT FIFO task with priority lower than 49
of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued
one cpu runqueue at one moment. By current strategy of balancing RT tasks,
when it comes to RT tasks, we really need to put them off to a CPU that
they can run on as soon as possible. Even if it means a bit of cache line
flushing, we want RT tasks to be run with the least latency.

When the user RT FIFO task which just launched before is running, the
sched timer tick of the current cpu happens. In this tick period, the
timeout value of the user RT task will be updated once. Subsequently,
we try to wake up one softirq RT task on its local cpu. As the priority
of current user RT task is lower than the softirq RT task, the current
task will be preempted by the higher priority softirq RT task.  Before
preemption, we check to see if current can readily move to a different
cpu. If so, we will reschedule to allow the RT push logic to try to move
current somewhere else. Whenever the woken softirq RT task runs, it first
tries to migrate the user FIFO RT task over to a cpu that is running a
task of lesser priority. If migration is done, it will send a reschedule
request to the found cpu by IPI interrupt. Once the target cpu responds
the IPI interrupt, it will pick the migrated user RT task to preempt its
current task. When the user RT task is running on the new cpu, the sched
timer tick of the cpu fires. So it will tick the user RT task again. This
also means the RT task timeout value will be updated again.  As the
migration may be done in one tick period, it means the user RT task
timeout value will be updated twice within one tick.

If we set a limit on the amount of cpu time for the user RT task by
setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon
reaching the soft limit.

But exactly when the SIGXCPU signal should be sent depends on the RT task
timeout value. In fact the timeout mechanism of sending the SIGXCPU signal
assumes the RT task timeout is increased once every tick.

However, currently the timeout value may be added twice per tick. So it
results in the SIGXCPU signal being sent earlier than expected.

To solve this issue, we prevent the timeout value from increasing twice
within one tick time by remembering the jiffies value of last updating the
timeout. As long as the RT task's jiffies is different with the global
jiffies value, we allow its timeout to be updated.

Signed-off-by: Ying Xue <[email protected]>
Signed-off-by: Fan Du <[email protected]>
Reviewed-by: Yong Zhang <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
[ lizf: backported to 3.4: adjust context ]
Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Change-Id: I1d126959d88f3bc1adb4a707661efb7162048424

sched: rt: Reduce rq lock contention by eliminating locking of non-feasible target

This patch adds checks that prevens futile attempts to move rt tasks to a
CPU with active tasks of equal or higher priority.

This reduces run queue lock contention and improves the performance of a
well known OLTP benchmark by 0.7%.

Change-Id: I232abe50623593d51be508fc64e8b865dcc77212
Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Shawn Bohrer <[email protected]>
Cc: Suruchi Kadu <[email protected]>
Cc: Doug Nelson<[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

sched: Fix RLIMIT_RTTIME when PI-boosting to RT

When non-realtime tasks get priority-inheritance boosted to a realtime
scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
counter used for checking this (the same one used for SCHED_RR
timeslices) was not getting reset. This meant that tasks running with a
non-realtime scheduling class which are repeatedly boosted to a realtime
one, but never block while they are running realtime, eventually hit the
timeout without ever running for a time over the limit. This patch
resets the realtime timeslice counter when un-PI-boosting from an RT to
a non-RT scheduling class.

I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
mutex which induces priority boosting and spins while boosted that gets
killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
does happen eventually with PREEMPT_VOLUNTARY kernels.

Change-Id: I732036119517d9a9625cea4644b4509c1d6cdf13
Signed-off-by: Brian Silverman <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
[lizf: Backported to 3.4: adjust contest]
Signed-off-by: Zefan Li <[email protected]>

sched: Queue RT tasks to head when prio drops

The following scenario does not work correctly:

Runqueue of CPUx contains two runnable and pinned tasks:

 T1: SCHED_FIFO, prio 80
 T2: SCHED_FIFO, prio 80

T1 is on the cpu and executes the following syscalls (classic priority
ceiling scenario):

 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90);
 ...
 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80);
 ...

Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back
to sleep the scheduler picks T2. Surprise!

The same happens w/o actual preemption when T1 is forced into the
scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes
pick_next_task() which returns T2. So T1 gets preempted and scheduled
out.

This happens because sched_setscheduler() dequeues T1 from the prio 90
list and then enqueues it on the tail of the prio 80 list behind T2.
This violates the POSIX spec and surprises user space which relies on
the guarantee that SCHED_FIFO tasks are not scheduled out unless they
give the CPU up voluntarily or are preempted by a higher priority
task. In the latter case the preempted task must get back on the CPU
after the preempting task schedules out again.

We fixed a similar issue already in commit 60db48c (sched: Queue a
deboosted task to the head of the RT prio queue). The same treatment
is necessary for sched_setscheduler(). So enqueue to head of the prio
bucket list if the priority of the task is lowered.

It might be possible that existing user space relies on the current
behaviour, but it can be considered highly unlikely due to the corner
case nature of the application scenario.

Change-Id: I65f159b3c2d265dcaf27b70bf9be88cf922ab845
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

sched: Fix the theoretical signal_wake_up() vs schedule() race

This is only theoretical, but after try_to_wake_up(p) was changed to check
p->state under p->pi_lock the code like:

	__set_current_state(TASK_INTERRUPTIBLE);
	schedule();

can miss a signal.  This is the special case of wait-for-condition, it
relies on try_to_wake_up/schedule interaction and thus it does not need
mb() between __set_current_state() and if (signal_pending).

However, this __set_current_state() can move into the critical section
protected by rq->lock, now that try_to_wake_up() takes another lock we
need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.

The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock).  This is what try_to_wake_up() already
does by the same reason.

We turn this wmb() into the new helper, smp_mb__before_spinlock(), for
better documentation and to allow the architectures to change the default
implementation.

While at it, kill smp_mb__after_lock(), it has no callers.

Perhaps we can also add smp_mb__before/after_spinunlock() for
prepare_to_wait().

Change-Id: I2b329b2ae213cc9221811ca244d2ebb044011a1b
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix race in idle_balance()

The scheduler main function 'schedule()' checks if there are no more tasks
on the runqueue. Then it checks if a task should be pulled in the current
runqueue in idle_balance() assuming it will go to idle otherwise.

But idle_balance() releases the rq->lock in order to look up the sched
domains and takes the lock again right after. That opens a window where
another cpu may put a task in our runqueue, so we won't go to idle but
we have filled the idle_stamp, thinking we will.

This patch closes the window by checking if the runqueue has been modified
but without pulling a task after taking the lock again, so we won't go to idle
right after in the __schedule() function.

Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8
Signed-off-by: Daniel Lezcano <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix load avg vs cpu-hotplug

Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.

Reported-by: Rakib Mullick <[email protected]>
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix load avg vs. cpu-hotplug

Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an
incomplete fix:

In particular, the problem is that at the point it calls
calc_load_migrate() nr_running := 1 (the stopper thread), so move the
call to CPU_DEAD where we're sure that nr_running := 0.

Also note that we can call calc_load_migrate() without serialization, we
know the state of rq is stable since its cpu is dead, and we modify the
global state using appropriate atomic ops.

Change-Id: I24063eebf4429d7e01838af2cb68e0f3c2cf2131
Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>

sched/nohz: Rewrite and fix load-avg computation -- again

Thanks to Charles Wang for spotting the defects in the current code:

 - If we go idle during the sample window -- after sampling, we get a
   negative bias because we can negate our own sample.

 - If we wake up during the sample window we get a positive bias
   because we push the sample to a known active period.

So rewrite the entire nohz load-avg muck once again, now adding
copious documentation to the code.

Reported-and-tested-by: Doug Smythies <[email protected]>
Reported-and-tested-by: Charles Wang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
[ minor edits ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Change-Id: Ib735002c79eaace9fa5f7795fdc2f80b638b56a7

sched: Fix the broken sched_rr_get_interval()

The caller of sched_sliced() should pass se.cfs_rq and se as the
arguments, however in sched_rr_get_interval() we gave it
rq.cfs_rq and se, which made the following computation obviously
wrong.

The change was introduced by commit:

  77034937dc45 sched: fix crash in sys_sched_rr_get_interval()

... 5 years ago, while it had been the correct 'cfs_rq_of' before
the commit. The change seems to be irrelevant to the commit
msg, which was to return a 0 timeslice for tasks that are on an
idle runqueue. So I believe that was just a plain typo.

Signed-off-by: Zhu Yanhai <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Since this is an ABI and an old bug, we'll test this via a
  slow upstream route, to hopefully discover any app breakage. ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix race in task_group()

Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call
task_group() too many times in set_task_rq()").  He found the reason to be
that the multiple task_group() invocations in set_task_rq() returned
different values.

Looking at all that I found a lack of serialization and plain wrong
comments.

The below tries to fix it using an extra pointer which is updated under
the appropriate scheduler locks. Its not pretty, but I can't really see
another way given how all the cgroup stuff works.

Change-Id: I6906c92ea6f66d19af5938a9f5ec9f1ff4577c39
Reported-and-tested-by: Stefan Bader <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume

In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.

However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.

In order to achieve that, do the following:

1. Don't modify cpusets during suspend/resume. At all.
   In particular, don't move the tasks from one cpuset to another, and
   don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
   during the CPU hotplug operations that are carried out in the
   suspend/resume path.

2. However, cpusets and sched domains are related. We just want to avoid
   altering cpusets alone. So, to keep the sched domains updated, build
   a single sched domain (containing all active cpus) during each of the
   CPU hotplug operations carried out in s/r path, effectively ignoring
   the cpusets' cpus_allowed masks.

   (Since userspace is frozen while doing all this, it will go unnoticed.)

3. During the last CPU online operation during resume, build the sched
   domains by looking up the (unaltered) cpusets' cpus_allowed masks.
   That will bring back the system to the same original state as it was in
   before suspend.

Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.

Change-Id: I12c3e6289d5aa069531f8f616d63412b38103d99
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Preeti U Murthy <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones

There is a small race between copy_process() and cgroup_attach_task()
where child->se.parent,cfs_rq points to invalid (old) ones.

        parent doing fork()      | someone moving the parent to another cgroup
  -------------------------------+---------------------------------------------
    copy_process()
      + dup_task_struct()
        -> parent->se is copied to child->se.
           se.parent,cfs_rq of them point to old ones.

                                     cgroup_attach_task()
                                       + cgroup_task_migrate()
                                         -> parent->cgroup is updated.
                                       + cpu_cgroup_attach()
                                         + sched_move_task()
                                           + task_move_group_fair()
                                             +- set_task_rq()
                                                -> se.parent,cfs_rq of parent
                                                   are updated.

      + cgroup_fork()
        -> parent->cgroup is copied to child->cgroup. (*1)
      + sched_fork()
        + task_fork_fair()
          -> se.parent,cfs_rq of child are accessed
             while they point to old ones. (*2)

In the worst case, this bug can lead to "use-after-free" and cause a panic,
because it's new cgroup's refcount that is incremented at (*1),
so the old cgroup(and related data) can be freed before (*2).

In fact, a panic caused by this bug was originally caught in RHEL6.4.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0
    [...]
    Call Trace:
     [<ffffffff81051f25>] place_entity+0x75/0xa0
     [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160
     [<ffffffff81063c0b>] sched_fork+0x6b/0x140
     [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450
     [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130
     [<ffffffff8106d2f4>] do_fork+0x94/0x460
     [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100
     [<ffffffff81009598>] sys_clone+0x28/0x30
     [<ffffffff8100b393>] stub_clone+0x13/0x20
     [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

Signed-off-by: Daisuke Nishimura <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/nohz: Fix rq->cpu_load[] calculations

While investigating why the load-balancer did funny I found that the
rq->cpu_load[] tables were completely screwy.. a bit more digging
revealed that the updates that got through were missing ticks followed
by a catchup of 2 ticks.

The catchup assumes the cpu was idle during that time (since only nohz
can cause missed ticks and the machine is idle etc..) this means that
esp. the higher indices were significantly lower than they ought to
be.

The reason for this is that its not correct to compare against jiffies
on every jiffy on any other cpu than the cpu that updates jiffies.

This patch cludges around it by only doing the catch-up stuff from
nohz_idle_balance() and doing the regular stuff unconditionally from
the tick.

Change-Id: Iac6eda1ded5074afb261ccc0fe8a4e70432bb031
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Venkatesh Pallipadi <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/nohz: Fix rq->cpu_load calculations some more

commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream.

Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[]
calculations") since while that fixed the busy case it regressed the
mostly idle case.

Add a callback from the nohz exit to also age the rq->cpu_load[]
array. This closes the hole where either there was no nohz load
balance pass during the nohz, or there was a 'significant' amount of
idle time between the last nohz balance and the nohz exit.

So we'll update unconditionally from the tick to not insert any
accidental 0 load periods while busy, and we try and catch up from
nohz idle balance and nohz exit. Both these are still prone to missing
a jiffy, but that has always been the case.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Venkatesh Pallipadi <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/core: Clear the root_domain cpumasks in init_rootdomain()

root_domain::rto_mask allocated through alloc_cpumask_var() contains
garbage data, and this may cause problems.  For instance, when doing
pull_rt_task(), it may do useless iterations if rto_mask retains some
extra garbage bits. Worse still, this violates the isolated domain rule
for clustered scheduling using cpuset, because the tasks (with all the
cpus allowed) belongs to one root domain can be pulled away into another
root domain.

The patch cleans the garbage by using zalloc_cpumask_var() instead of
alloc_cpumask_var() for root_domain::rto_mask allocation, thereby
addressing the issues.

Do the same thing for root_domain's other cpumask memembers:
- span,
- online;

Signed-off-by: Xunlei Pang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

softirq: Reduce latencies

In various network workloads, __do_softirq() latencies can be up to 20 ms
if HZ=1000, and 200 ms if HZ=100.

This is because we iterate 10 times in the softirq dispatcher, and some
actions can consume a lot of cycles.

This patch changes the fallback to ksoftirqd condition to:
- A time limit of 2 ms,
- need_resched() being set on current task;

When one of this condition is met, we wakeup ksoftirqd for further softirq
processing if we still have pending softirqs.

Using need_resched() as the only condition can trigger RCU stalls, as we
can keep BH disabled for too long.

I ran several benchmarks and got no significant difference in throughput,
but a very significant reduction of latencies (one order of magnitude):

In following bench, 200 antagonist "netperf -t TCP_RR" are started in
background, using all available cpus.

Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC
IRQ (hard + soft).

Before patch:

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=550110.424
MIN_LATENCY=146858
MAX_LATENCY=997109
P50_LATENCY=305000
P90_LATENCY=550000
P99_LATENCY=710000
MEAN_LATENCY=376989.12
STDDEV_LATENCY=184046.92

After patch:

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=40545.492
MIN_LATENCY=9834
MAX_LATENCY=78366
P50_LATENCY=33583
P90_LATENCY=59000
P99_LATENCY=69000
MEAN_LATENCY=38364.67
STDDEV_LATENCY=12865.26

Signed-off-by: Eric Dumazet <[email protected]>
Cc: David Miller <[email protected]>
Cc: Tom Herbert <[email protected]>
Cc: Ben Hutchings <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

softirq: Fix lockup related to stop_machine being stuck in __do_softirq.

The stop machine logic can lock up if all but one of the migration threads
make it through the disable-irq step and the one remaining thread gets
stuck in __do_softirq. The reason __do_softirq can hang is that it has a
bail-out based on jiffies timeout, but in the lockup case, jiffies itself
is not incremented.

To work around this, re-add the max_restart counter in __do_irq and stop
processing irqs after 10 restarts.

Thanks to Tejun Heo and Rusty Russell and others for helping me track this
down.

This was introduced in 3.9 by commit c10d73671ad3
("softirq: reduce latencies").

It may be worth looking into ath9k to see if it has issues with its irq
handler at a later date.

The hang stack traces look something like this:

    ------------[ cut here ]------------
    WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7()
    Watchdog detected hard LOCKUP on cpu 2
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    Pid: 23, comm: migration/2 Tainted: G         C   3.9.4+ #11
    Call Trace:
     <NMI>   warn_slowpath_common+0x85/0x9f
      warn_slowpath_fmt+0x46/0x48
      watchdog_overflow_callback+0x9c/0xa7
      __perf_event_overflow+0x137/0x1cb
      perf_event_overflow+0x14/0x16
      intel_pmu_handle_irq+0x2dc/0x359
      perf_event_nmi_handler+0x19/0x1b
      nmi_handle+0x7f/0xc2
      do_nmi+0xbc/0x304
      end_repeat_nmi+0x1e/0x2e
     <<EOE>>
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0
    ---[ end trace 4947dfa9b0a4cec3 ]---
    BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17]
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    irq event stamp: 835637905
    hardirqs last  enabled at (835637904): __do_softirq+0x9f/0x257
    hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80
    softirqs last  enabled at (5654720): __do_softirq+0x1ff/0x257
    softirqs last disabled at (5654725): irq_exit+0x5f/0xbb
    CPU 1
    Pid: 17, comm: migration/1 Tainted: G        WC   3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
    RIP: tasklet_hi_action+0xf0/0xf0
    Process migration/1
    Call Trace:
     <IRQ>
      __do_softirq+0x117/0x257
      irq_exit+0x5f/0xbb
      smp_apic_timer_interrupt+0x8a/0x98
      apic_timer_interrupt+0x72/0x80
     <EOI>
      printk+0x4d/0x4f
      stop_machine_cpu_stop+0x22c/0x274
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0

Change-Id: Ib213d5a5c59c85ae16afde6f42a01e39d2c93fd7
Signed-off-by: Ben Greear <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Pekka Riikonen <[email protected]>
Cc: Eric Dumazet <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <[email protected]>
Signed-off-by: Zefan Li <[email protected]>
alesaiko pushed a commit that referenced this issue Mar 13, 2018
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior
of waking their tasks up on idle CPUs. The feature has too much of a
negative impact on other workloads however to apply globally. The
PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this
flag set, or tasks woken by tasks with this flag set, on an idle CPU
if one is available.

Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039
Signed-off-by: Steve Muckle <[email protected]>

sched: add sysctl for controlling task migrations on wake

The PF_WAKE_UP_IDLE per-task flag made it impossible to enable
the old behavior of SD_SHARE_PKG_RESOURCES, where every task
migrates to an idle CPU on wakeup.

The sched_wake_to_idle sysctl value, when made nonzero, will cause
all tasks to migrate to an idle CPU if one is available when the
task is woken up. This is regardless of how PF_WAKE_UP_IDLE is
configured for tasks in the system. Similar to PF_WAKE_UP_IDLE,
the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled
for the sysctl value to have an effect.

Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61
Signed-off-by: Steve Muckle <[email protected]>
(cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282)

Conflicts:

	kernel/sched/fair.c

sched: Reset rq->next_interval before going idle

next_balance, the point in jiffy time scale when a cpu will next load
balance, could have been calculated when the cpu was busy. A busy cpu
will apply its sched domain's busy_factor (usually > 1) in computing
next_balance for that sched domain, which causes the (busy) cpu to load
balance less frequently in its sched domains. However when the same cpu
is going idle, its next_balance needs to be reset without consideration
of busy_factor. Failure to do so would not trigger nohz idle balancer on
that cpu for unnecessarily long time (introducing additional scheduling
latencies for tasks). Fix bug in scheduler which aims to reset
next_balance before a cpu goes idle (as per existing comment) but is
clearly not doing so.

Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
(just when they would have exceeded their ideal_runtime). It makes use
of a a per-cpu hrtimer resource and hence alarming that hrtimer should
be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs,
rather than being based on number of tasks in a particular cfs_rq (as
implemented currently). As a result, with current code, its possible for
a running task (which is the sole task in its cfs_rq) to be preempted
much after its ideal_runtime has elapsed, resulting in increased latency
for tasks in other cfs_rq on same cpu.

Fix this by alarming sched hrtimer based on total number of SCHED_FAIR
tasks a CPU has across its various cfs_rqs.

Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Mark schedule_io_timeout() with EXPORT_SYMBOL

Make schedule_io_timeout() visible to modules.

Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2
Signed-off-by: Jordan Crouse <[email protected]>

sched: fix reference to wrong cfs_rq

Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption
of tasks) introduced a bug in sched_slice() calculation by using wrong
cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather
than the correct one to which they belonged.

Fix the bug by using correct cfs_rq for tasks.

Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Make sure to not re-read variables after validation

We could re-read rq->rt_avg after we validated it was smaller than
total, invalidating the check and resulting in an unintended negative.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: David Rientjes <[email protected]>
Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
CRs-Fixed: 497236
Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
(cherry picked from commit fef154fa32c5d8ef992f11e11919495979970678)

Change-Id: Ieac7a10d9f4ceb701660f8055efd7a0ed5d939ca
Signed-off-by: Sridhar Nelloru <[email protected]>

sched: re-calculate a cpu's next_balance point upon sched domain changes

Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset
a cpu's rq->next_balance when pulled_task = 0, which will be true when
the cpu failed to pull any task, causing it go idle. However that patch
relied on next_balance being calculated as a result of traversing cpu's
sched domain hierarchy.

A cpu that is the only online cpu will however not be attached to any
sched domain hierarchy. When such a cpu calls into idle_balance(), we
will end up initializing next_balance to be 1sec away! Such a CPU will
defer load balance check for another 1sec, even though we may bring up
more cpus in the meantime requiring it to check for load imbalance more
frequently. This could then lead to increased scheduling latency for
some tasks.

This patch results in a cpu's next_balance being re-calculated when its
attaching to a new sched domain hierarchy.  This should let cpus call
load balance checks at the right time we expect them to!

Signed-off-by: Srivatsa Vaddagiri <[email protected]>
(cherry picked from commit 91b609bdd91513c37c410ffef0204f8a0578e4f8)

Change-Id: Ib457f1a357aad26e49837a01311b73e22e0cc26f
Signed-off-by: Sridhar Nelloru <[email protected]>

sched: Unthrottle rt runqueues in __disable_runtime()

migrate_tasks() uses _pick_next_task_rt() to get tasks from the
real-time runqueues to be migrated. When rt_rq is throttled
_pick_next_task_rt() won't return anything, in which case
migrate_tasks() can't move all threads over and gets stuck in an
infinite loop.

Instead unthrottle rt runqueues before migrating tasks.

Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()

Signed-off-by: Peter Boonstoppel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix clear NOHZ_BALANCE_KICK

I have faced a sequence where the Idle Load Balance was sometime not
triggered for a while on my platform, in the following scenario:

 CPU 0 and CPU 1 are running tasks and CPU 2 is idle

 CPU 1 kicks the Idle Load Balance
 CPU 1 selects CPU 2 as the new Idle Load Balancer
 CPU 2 sets NOHZ_BALANCE_KICK for CPU 2
 CPU 2 sends a reschedule IPI to CPU 2

 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2

 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
       task A quickly goes back to sleep (before a tick occurs on CPU 2)
 CPU 2 goes back to idle with NOHZ_BALANCE_KICK set

Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent
because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be
performed.

We must wait for the sched softirq to be raised on CPU 2 thanks to another
part the kernel to come back to clear NOHZ_BALANCE_KICK.

The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
we can't raise the sched_softirq for the Idle Load Balance.

Change since V1:

- move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB
  can't run on this CPU (as suggested by Peter)

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 873b4c65b519fd769940eb281f77848227d4e5c1
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: minor merge resolution for 3.4 in scheduler_ipi()]
Signed-off-by: Steve Muckle <[email protected]>
Change-Id: I3548612057cccc2ecc29429c129c44183083831f

sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s

try_to_wake_up_local() should only be invoked to wake up another
task in the same runqueue and BUG_ON()s are used to enforce the
rule. Missing try_to_wake_up_local() can stall workqueue
execution but such stalls are likely to be finite either by
another work item being queued or the one blocked getting
unblocked.  There's no reason to trigger BUG while holding rq
lock crashing the whole system.

Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s.

Change-Id: I75fdaaf4dcaefcf3893e0404d98c8aa91f89934d
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Git-commit: 383efcd00053ec40023010ce5034bd702e7ab373
CRs-Fixed: 519942
Signed-off-by: Steve Muckle <[email protected]>

sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local()

The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local()
were recently converted from BUG_ON() calls. If these hit it indicates
something is wrong and that may contribute to other system instability.
To eliminate the risk of an instance of one of these errors going
un-noticed because there was an earlier instance that occured long ago,
change to WARN_ON(). If there ever is a flood of these there are bigger
problems.

Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7
Signed-off-by: Steve Muckle <[email protected]>

sched: Convert WARN_ON() to printk_sched() in try_to_wake_up_local()

try_to_wake_up_local() is called with the rq lock held. Printing to
console in this context can result in a deadlock if klogd needs to
be woken up. Print to the kernel log buffer via printk_sched()
instead which avoids the wakeup.

Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c
Signed-off-by: Syed Rameez Mustafa <[email protected]>

sched/rt: Use root_domain of rt_rq not current processor

When the system has multiple domains do_sched_rt_period_timer()
can run on any CPU and may iterate over all rt_rq in
cpu_online_mask.  This means when balance_runtime() is run for a
given rt_rq that rt_rq may be in a different rd than the current
processor.  Thus if we use smp_processor_id() to get rd in
do_balance_runtime() we may borrow runtime from a rt_rq that is
not part of our rd.

This changes do_balance_runtime to get the rd from the passed in
rt_rq ensuring that we borrow runtime only from the correct rd
for the given rt_rq.

This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime
when we try reclaim runtime lent to other rt_rq but runtime has
been lent to a rt_rq in another rd.

Change-Id: Id8b9f88bdd783ef87bfa8d288a2fe7e86dd0b1e2
Signed-off-by: Shawn Bohrer <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Acked-by: Mike Galbraith <[email protected]>
Cc: [email protected]
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 89960feebaf4f9a53f93a0ce6888207e4a808799
Git-repo: git://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git
Signed-off-by: Syed Rameez Mustafa <[email protected]>

sched: Remove redundant update_runtime notifier

migration_call() will do all the things that update_runtime() does.
So let's remove it.

Furthermore, there is potential risk that the current code will catch
BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime
threads running because of enabling runtime twice while the rt_runtime
may already changed.

Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8
Signed-off-by: Neil Zhang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: resolved trivial header file context conflict]
Signed-off-by: Matt Wagantall <[email protected]>

sched: Fix hotplug vs. set_cpus_allowed_ptr()

Lai found that:

  WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b()
  ...
  migration_cpu_stop+0x1d/0x22

was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is
always a sub-set of cpu_online_mask.

This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness").

So set active and online at the same time to avoid this particular
problem.

CRs-Fixed: 680496
Change-Id: I89ac9b6829acf200072975bc7d028a469167f083
Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness")
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Gautham R. Shenoy <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael wang <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srivatsa S. Bhat <[email protected]>
Cc: Toshi Kani <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 24d52daafcd53f109893ec2faf668c8eeef4a382
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Matt Wagantall <[email protected]>

sched: Fix race between try_to_wake_up() and move_task()

Until a task's state has been seen as interruptible/uninterruptible
and it is no longer on_cpu, it is possible that the task may move
to another CPU (load balancing may cause this). Here is an example
where the race condition results in incorrect operation:

- cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING
- cpu 0 runs task B, which attempts to wake up A
- cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0
- cpu 1 then pulls task A (perhaps due to idle balance)
- cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE
- cpu 0 continues in try_to_wake_up(), thinking task A's previous
  cpu is 0, where it is actually 1
- if select_task_rq returns cpu 0, task A will be woken up on cpu 0
  without properly updating its cpu to 0 in set_task_cpu()

CRs-Fixed: 665958
Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99
Signed-off-by: Steve Muckle <[email protected]>

sched: Fix reference to stale task_struct in try_to_wake_up()

try_to_wake_up() currently drops p->pi_lock and later checks for need
to notify cpufreq governor on task migrations or wakeups. However the
woken task could exit between the time p->pi_lock is released and the
time the test for notification is run. As a result, the test for
notification could refer to an exited task. task_notify_on_migrate(p)
could thus lead to invalid memory reference.

Fix this by running the test for notification with task's pi_lock
held.

Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Bail out of yield_to when source and target runqueue has one task

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple handler upon
 seeing source runqueue length = 1, but it had to export rq length).
 Peter came up with the elegant idea of return -ESRCH in scheduler core.

Signed-off-by: Peter Zijlstra <[email protected]>
Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Andrew Jones <[email protected]>
Tested-by: Chegu Vinod <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>

Change-Id: I4eca8afc683c1a4b9d2490422baacd3da76c1e92

sched: Fix select_idle_sibling() bouncing cow syndrome

If the previous CPU is cache affine and idle, select it.

The current implementation simply traverses the sd_llc domain,
taking the first idle CPU encountered, which walks buddy pairs
hand in hand over the package, inflicting excruciating pain.

1 tbench pair (worst case) in a 10 core + SMT package:

  pre   15.22 MB/sec 1 procs
  post 252.01 MB/sec 1 procs

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix double normalization of vruntime

dequeue_entity() is called when p->on_rq and sets se->on_rq = 0
which appears to guarentee that the !se->on_rq condition is met.
If the task has done set_current_state(TASK_INTERRUPTIBLE) without
schedule() the second condition will be met and vruntime will be
incorrectly adjusted twice.

In certain cases this can result in the task's vruntime never increasing
past the vruntime of other tasks on the CFS' run queue, starving them of
CPU time.

This patch changes switched_from_fair() to use !p->on_rq instead of
!se->on_rq.

I'm able to cause a task with a priority of 120 to starve all other
tasks with the same priority on an ARM platform running 3.2.51-rt72
PREEMPT RT by writing one character at time to a serial tty (16550 UART)
in a tight loop. I'm also able to verify making this change corrects the
problem on that platform and kernel version.

Change-Id: I70aa0fe0e9be146b1a2a903d010b318390262dc3
Signed-off-by: George McCollister <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix divide by zero at {thread_group,task}_times

On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.

This problem can be triggered in practice on very long lived processes:

  PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
   #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
   #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
   #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
   #3 [ffff880472a51cd0] die at ffffffff8100f26b
   #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
   #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
   #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
      [exception RIP: thread_group_times+0x56]
      RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
      RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
      RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
      RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
      R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
   #8 [ffff880472a51f40] sys_times at ffffffff81088524
   #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
      RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
      RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
      RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
      RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
      R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
      ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b

Change-Id: I1e0fa1fa38b86566f5b20e5c0efa0c9eb17b203b
Signed-off-by: Stanislaw Gruszka <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched_clock: Prevent 64bit inatomicity on 32bit systems

The sched_clock_remote() implementation has the following inatomicity
problem on 32bit systems when accessing the remote scd->clock, which
is a 64bit value.

CPU0			    CPU1

sched_clock_local()	    sched_clock_remote(CPU0)
...
			    remote_clock = scd[CPU0]->clock
			    read_low32bit(scd[CPU0]->clock)
cmpxchg64(scd->clock,...)
			    read_high32bit(scd[CPU0]->clock)

While the update of scd->clock is using an atomic64 mechanism, the
readout on the remote cpu is not, which can cause completely bogus
readouts.

It is a quite rare problem, because it requires the update to hit the
narrow race window between the low/high readout and the update must go
across the 32bit boundary.

The resulting misbehaviour is, that CPU1 will see the sched_clock on
CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
to this bogus timestamp. This stays that way due to the clamping
implementation for about 4 seconds until the synchronization with
CLOCK_MONOTONIC undoes the problem.

The issue is hard to observe, because it might only result in a less
accurate SCHED_OTHER timeslicing behaviour. To create observable
damage on realtime scheduling classes, it is necessary that the bogus
update of CPU1 sched_clock happens in the context of an realtime
thread, which then gets charged 4 seconds of RT runtime, which results
in the RT throttler mechanism to trigger and prevent scheduling of RT
tasks for a little less than 4 seconds. So this is quite unlikely as
well.

The issue was quite hard to decode as the reproduction time is between
2 days and 3 weeks and intrusive tracing makes it less likely, but the
following trace recorded with trace_clock=global, which uses
sched_clock_local(), gave the final hint:

  <idle>-0   0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
  <idle>-0   0d..30 400269.477151: hrtimer_start:  hrtimer=0xf7061e80 ...
irq/20-S-587 1d..32 400273.772118: sched_wakeup:   comm= ... target_cpu=0
  <idle>-0   0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80

What happens is that CPU0 goes idle and invokes
sched_clock_idle_sleep_event() which invokes sched_clock_local() and
CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
sched_remote_clock(). The time jump gets propagated to CPU0 via
sched_remote_clock() and stays stale on both cores for ~4 seconds.

There are only two other possibilities, which could cause a stale
sched clock:

1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
   wrong value.

2) sched_clock() which reads the TSC returns a sporadic wrong value.

   one jiffy and then go stale.

   forward. It would just result in a stale sched_clock for one jiffy.

After quite some brain twisting and finding the same pattern on other
traces, sched_clock_remote() remained the only place which could cause
such a problem and as explained above it's indeed racy on 32bit
systems.

So while on 64bit systems the readout is atomic, we need to verify the
remote readout on 32bit machines. We need to protect the local->clock
readout in sched_clock_remote() on 32bit as well because an NMI could
hit between the low and the high readout, call sched_clock_local() and
modify local->clock.

Thanks to Siegfried Wulsch for bearing with my debug requests and
going through the tedious tasks of running a bunch of reproducer
systems to generate the debug information which let me decode the
issue.

Change-Id: I767a6a645e17651b39d0a530dc0681756f3b4f5d
Reported-by: Siegfried Wulsch <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities

This patch touches the RT group scheduling case.

Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
priority, while rt_rq passed to them may be not the top-level rt_rq.
This is wrong, because changing of priority on a child level does not
guarantee that the priority is the highest all over the rq. So, this
leak makes RT balancing unusable.

The short example: the task having the highest priority among all rq's
RT tasks (no one other task has the same priority) are waking on a
throttle rt_rq.  The rq's cpupri is set to the task's priority
equivalent, but real rq->rt.highest_prio.curr is less.

The patch below fixes the problem.

Change-Id: I01d3cc07870b1825efa1deed3b70f060ac20939f
Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
CC: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Guarantee new group-entities always have weight

Currently, group entity load-weights are initialized to zero. This
admits some races with respect to the first time they are re-weighted in
earlty use. ( Let g[x] denote the se for "g" on cpu "x". )

Suppose that we have root->a and that a enters a throttled state,
immediately followed by a[0]->t1 (the only task running on cpu[0])
blocking:

  put_prev_task(group_cfs_rq(a[0]), t1)
  put_prev_entity(..., t1)
  check_cfs_rq_runtime(group_cfs_rq(a[0]))
  throttle_cfs_rq(group_cfs_rq(a[0]))

Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
time:

  enqueue_task_fair(rq[0], t2)
  enqueue_entity(group_cfs_rq(b[0]), t2)
  enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
  account_entity_enqueue(group_cfs_ra(b[0]), t2)
  update_cfs_shares(group_cfs_rq(b[0]))
  < skipped because b is part of a throttled hierarchy >
  enqueue_entity(group_cfs_rq(a[0]), b[0])
  ...

We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
which violates invariants in several code-paths. Eliminate the
possibility of this by initializing group entity weight.

Change-Id: I1b09c5c4f85df1de565e7c862ba8dde65851c05f
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Chris J Arges <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Fix isolated CPUs leaving root_task_group indefinitely throttled

Root task group bandwidth replenishment must service all CPUs, regardless of
where the timer was last started, and regardless of the isolation mechanism,
lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.

Change-Id: I8485724e3ff0960008aa13634e55bcb6ceacbc01
Signed-off-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Avoid updating RT entry timeout twice within one tick period

The issue below was found in 2.6.34-rt rather than mainline rt kernel, but
the issue still exists upstream as well.

So please let me describe how it was noticed on 2.6.34-rt:

On this version, each softirq has its own thread, which means there is at
least one RT FIFO task per cpu.  The priority of these tasks is set to 49
by default.  If user launches an RT FIFO task with priority lower than 49
of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued
one cpu runqueue at one moment. By current strategy of balancing RT tasks,
when it comes to RT tasks, we really need to put them off to a CPU that
they can run on as soon as possible. Even if it means a bit of cache line
flushing, we want RT tasks to be run with the least latency.

When the user RT FIFO task which just launched before is running, the
sched timer tick of the current cpu happens. In this tick period, the
timeout value of the user RT task will be updated once. Subsequently,
we try to wake up one softirq RT task on its local cpu. As the priority
of current user RT task is lower than the softirq RT task, the current
task will be preempted by the higher priority softirq RT task.  Before
preemption, we check to see if current can readily move to a different
cpu. If so, we will reschedule to allow the RT push logic to try to move
current somewhere else. Whenever the woken softirq RT task runs, it first
tries to migrate the user FIFO RT task over to a cpu that is running a
task of lesser priority. If migration is done, it will send a reschedule
request to the found cpu by IPI interrupt. Once the target cpu responds
the IPI interrupt, it will pick the migrated user RT task to preempt its
current task. When the user RT task is running on the new cpu, the sched
timer tick of the cpu fires. So it will tick the user RT task again. This
also means the RT task timeout value will be updated again.  As the
migration may be done in one tick period, it means the user RT task
timeout value will be updated twice within one tick.

If we set a limit on the amount of cpu time for the user RT task by
setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon
reaching the soft limit.

But exactly when the SIGXCPU signal should be sent depends on the RT task
timeout value. In fact the timeout mechanism of sending the SIGXCPU signal
assumes the RT task timeout is increased once every tick.

However, currently the timeout value may be added twice per tick. So it
results in the SIGXCPU signal being sent earlier than expected.

To solve this issue, we prevent the timeout value from increasing twice
within one tick time by remembering the jiffies value of last updating the
timeout. As long as the RT task's jiffies is different with the global
jiffies value, we allow its timeout to be updated.

Signed-off-by: Ying Xue <[email protected]>
Signed-off-by: Fan Du <[email protected]>
Reviewed-by: Yong Zhang <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
[ lizf: backported to 3.4: adjust context ]
Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Change-Id: I1d126959d88f3bc1adb4a707661efb7162048424

sched: rt: Reduce rq lock contention by eliminating locking of non-feasible target

This patch adds checks that prevens futile attempts to move rt tasks to a
CPU with active tasks of equal or higher priority.

This reduces run queue lock contention and improves the performance of a
well known OLTP benchmark by 0.7%.

Change-Id: I232abe50623593d51be508fc64e8b865dcc77212
Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Shawn Bohrer <[email protected]>
Cc: Suruchi Kadu <[email protected]>
Cc: Doug Nelson<[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

sched: Fix RLIMIT_RTTIME when PI-boosting to RT

When non-realtime tasks get priority-inheritance boosted to a realtime
scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
counter used for checking this (the same one used for SCHED_RR
timeslices) was not getting reset. This meant that tasks running with a
non-realtime scheduling class which are repeatedly boosted to a realtime
one, but never block while they are running realtime, eventually hit the
timeout without ever running for a time over the limit. This patch
resets the realtime timeslice counter when un-PI-boosting from an RT to
a non-RT scheduling class.

I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
mutex which induces priority boosting and spins while boosted that gets
killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
does happen eventually with PREEMPT_VOLUNTARY kernels.

Change-Id: I732036119517d9a9625cea4644b4509c1d6cdf13
Signed-off-by: Brian Silverman <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
[lizf: Backported to 3.4: adjust contest]
Signed-off-by: Zefan Li <[email protected]>

sched: Queue RT tasks to head when prio drops

The following scenario does not work correctly:

Runqueue of CPUx contains two runnable and pinned tasks:

 T1: SCHED_FIFO, prio 80
 T2: SCHED_FIFO, prio 80

T1 is on the cpu and executes the following syscalls (classic priority
ceiling scenario):

 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90);
 ...
 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80);
 ...

Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back
to sleep the scheduler picks T2. Surprise!

The same happens w/o actual preemption when T1 is forced into the
scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes
pick_next_task() which returns T2. So T1 gets preempted and scheduled
out.

This happens because sched_setscheduler() dequeues T1 from the prio 90
list and then enqueues it on the tail of the prio 80 list behind T2.
This violates the POSIX spec and surprises user space which relies on
the guarantee that SCHED_FIFO tasks are not scheduled out unless they
give the CPU up voluntarily or are preempted by a higher priority
task. In the latter case the preempted task must get back on the CPU
after the preempting task schedules out again.

We fixed a similar issue already in commit 60db48c (sched: Queue a
deboosted task to the head of the RT prio queue). The same treatment
is necessary for sched_setscheduler(). So enqueue to head of the prio
bucket list if the priority of the task is lowered.

It might be possible that existing user space relies on the current
behaviour, but it can be considered highly unlikely due to the corner
case nature of the application scenario.

Change-Id: I65f159b3c2d265dcaf27b70bf9be88cf922ab845
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

sched: Fix the theoretical signal_wake_up() vs schedule() race

This is only theoretical, but after try_to_wake_up(p) was changed to check
p->state under p->pi_lock the code like:

	__set_current_state(TASK_INTERRUPTIBLE);
	schedule();

can miss a signal.  This is the special case of wait-for-condition, it
relies on try_to_wake_up/schedule interaction and thus it does not need
mb() between __set_current_state() and if (signal_pending).

However, this __set_current_state() can move into the critical section
protected by rq->lock, now that try_to_wake_up() takes another lock we
need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.

The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock).  This is what try_to_wake_up() already
does by the same reason.

We turn this wmb() into the new helper, smp_mb__before_spinlock(), for
better documentation and to allow the architectures to change the default
implementation.

While at it, kill smp_mb__after_lock(), it has no callers.

Perhaps we can also add smp_mb__before/after_spinunlock() for
prepare_to_wait().

Change-Id: I2b329b2ae213cc9221811ca244d2ebb044011a1b
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix race in idle_balance()

The scheduler main function 'schedule()' checks if there are no more tasks
on the runqueue. Then it checks if a task should be pulled in the current
runqueue in idle_balance() assuming it will go to idle otherwise.

But idle_balance() releases the rq->lock in order to look up the sched
domains and takes the lock again right after. That opens a window where
another cpu may put a task in our runqueue, so we won't go to idle but
we have filled the idle_stamp, thinking we will.

This patch closes the window by checking if the runqueue has been modified
but without pulling a task after taking the lock again, so we won't go to idle
right after in the __schedule() function.

Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8
Signed-off-by: Daniel Lezcano <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix load avg vs cpu-hotplug

Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.

Reported-by: Rakib Mullick <[email protected]>
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix load avg vs. cpu-hotplug

Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an
incomplete fix:

In particular, the problem is that at the point it calls
calc_load_migrate() nr_running := 1 (the stopper thread), so move the
call to CPU_DEAD where we're sure that nr_running := 0.

Also note that we can call calc_load_migrate() without serialization, we
know the state of rq is stable since its cpu is dead, and we modify the
global state using appropriate atomic ops.

Change-Id: I24063eebf4429d7e01838af2cb68e0f3c2cf2131
Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>

sched/nohz: Rewrite and fix load-avg computation -- again

Thanks to Charles Wang for spotting the defects in the current code:

 - If we go idle during the sample window -- after sampling, we get a
   negative bias because we can negate our own sample.

 - If we wake up during the sample window we get a positive bias
   because we push the sample to a known active period.

So rewrite the entire nohz load-avg muck once again, now adding
copious documentation to the code.

Reported-and-tested-by: Doug Smythies <[email protected]>
Reported-and-tested-by: Charles Wang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
[ minor edits ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Change-Id: Ib735002c79eaace9fa5f7795fdc2f80b638b56a7

sched: Fix the broken sched_rr_get_interval()

The caller of sched_sliced() should pass se.cfs_rq and se as the
arguments, however in sched_rr_get_interval() we gave it
rq.cfs_rq and se, which made the following computation obviously
wrong.

The change was introduced by commit:

  77034937dc45 sched: fix crash in sys_sched_rr_get_interval()

... 5 years ago, while it had been the correct 'cfs_rq_of' before
the commit. The change seems to be irrelevant to the commit
msg, which was to return a 0 timeslice for tasks that are on an
idle runqueue. So I believe that was just a plain typo.

Signed-off-by: Zhu Yanhai <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Since this is an ABI and an old bug, we'll test this via a
  slow upstream route, to hopefully discover any app breakage. ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix race in task_group()

Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call
task_group() too many times in set_task_rq()").  He found the reason to be
that the multiple task_group() invocations in set_task_rq() returned
different values.

Looking at all that I found a lack of serialization and plain wrong
comments.

The below tries to fix it using an extra pointer which is updated under
the appropriate scheduler locks. Its not pretty, but I can't really see
another way given how all the cgroup stuff works.

Change-Id: I6906c92ea6f66d19af5938a9f5ec9f1ff4577c39
Reported-and-tested-by: Stefan Bader <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume

In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.

However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.

In order to achieve that, do the following:

1. Don't modify cpusets during suspend/resume. At all.
   In particular, don't move the tasks from one cpuset to another, and
   don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
   during the CPU hotplug operations that are carried out in the
   suspend/resume path.

2. However, cpusets and sched domains are related. We just want to avoid
   altering cpusets alone. So, to keep the sched domains updated, build
   a single sched domain (containing all active cpus) during each of the
   CPU hotplug operations carried out in s/r path, effectively ignoring
   the cpusets' cpus_allowed masks.

   (Since userspace is frozen while doing all this, it will go unnoticed.)

3. During the last CPU online operation during resume, build the sched
   domains by looking up the (unaltered) cpusets' cpus_allowed masks.
   That will bring back the system to the same original state as it was in
   before suspend.

Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.

Change-Id: I12c3e6289d5aa069531f8f616d63412b38103d99
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Preeti U Murthy <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones

There is a small race between copy_process() and cgroup_attach_task()
where child->se.parent,cfs_rq points to invalid (old) ones.

        parent doing fork()      | someone moving the parent to another cgroup
  -------------------------------+---------------------------------------------
    copy_process()
      + dup_task_struct()
        -> parent->se is copied to child->se.
           se.parent,cfs_rq of them point to old ones.

                                     cgroup_attach_task()
                                       + cgroup_task_migrate()
                                         -> parent->cgroup is updated.
                                       + cpu_cgroup_attach()
                                         + sched_move_task()
                                           + task_move_group_fair()
                                             +- set_task_rq()
                                                -> se.parent,cfs_rq of parent
                                                   are updated.

      + cgroup_fork()
        -> parent->cgroup is copied to child->cgroup. (*1)
      + sched_fork()
        + task_fork_fair()
          -> se.parent,cfs_rq of child are accessed
             while they point to old ones. (*2)

In the worst case, this bug can lead to "use-after-free" and cause a panic,
because it's new cgroup's refcount that is incremented at (*1),
so the old cgroup(and related data) can be freed before (*2).

In fact, a panic caused by this bug was originally caught in RHEL6.4.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0
    [...]
    Call Trace:
     [<ffffffff81051f25>] place_entity+0x75/0xa0
     [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160
     [<ffffffff81063c0b>] sched_fork+0x6b/0x140
     [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450
     [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130
     [<ffffffff8106d2f4>] do_fork+0x94/0x460
     [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100
     [<ffffffff81009598>] sys_clone+0x28/0x30
     [<ffffffff8100b393>] stub_clone+0x13/0x20
     [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

Signed-off-by: Daisuke Nishimura <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/nohz: Fix rq->cpu_load[] calculations

While investigating why the load-balancer did funny I found that the
rq->cpu_load[] tables were completely screwy.. a bit more digging
revealed that the updates that got through were missing ticks followed
by a catchup of 2 ticks.

The catchup assumes the cpu was idle during that time (since only nohz
can cause missed ticks and the machine is idle etc..) this means that
esp. the higher indices were significantly lower than they ought to
be.

The reason for this is that its not correct to compare against jiffies
on every jiffy on any other cpu than the cpu that updates jiffies.

This patch cludges around it by only doing the catch-up stuff from
nohz_idle_balance() and doing the regular stuff unconditionally from
the tick.

Change-Id: Iac6eda1ded5074afb261ccc0fe8a4e70432bb031
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Venkatesh Pallipadi <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/nohz: Fix rq->cpu_load calculations some more

commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream.

Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[]
calculations") since while that fixed the busy case it regressed the
mostly idle case.

Add a callback from the nohz exit to also age the rq->cpu_load[]
array. This closes the hole where either there was no nohz load
balance pass during the nohz, or there was a 'significant' amount of
idle time between the last nohz balance and the nohz exit.

So we'll update unconditionally from the tick to not insert any
accidental 0 load periods while busy, and we try and catch up from
nohz idle balance and nohz exit. Both these are still prone to missing
a jiffy, but that has always been the case.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Venkatesh Pallipadi <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/core: Clear the root_domain cpumasks in init_rootdomain()

root_domain::rto_mask allocated through alloc_cpumask_var() contains
garbage data, and this may cause problems.  For instance, when doing
pull_rt_task(), it may do useless iterations if rto_mask retains some
extra garbage bits. Worse still, this violates the isolated domain rule
for clustered scheduling using cpuset, because the tasks (with all the
cpus allowed) belongs to one root domain can be pulled away into another
root domain.

The patch cleans the garbage by using zalloc_cpumask_var() instead of
alloc_cpumask_var() for root_domain::rto_mask allocation, thereby
addressing the issues.

Do the same thing for root_domain's other cpumask memembers:
- span,
- online;

Signed-off-by: Xunlei Pang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

softirq: Reduce latencies

In various network workloads, __do_softirq() latencies can be up to 20 ms
if HZ=1000, and 200 ms if HZ=100.

This is because we iterate 10 times in the softirq dispatcher, and some
actions can consume a lot of cycles.

This patch changes the fallback to ksoftirqd condition to:
- A time limit of 2 ms,
- need_resched() being set on current task;

When one of this condition is met, we wakeup ksoftirqd for further softirq
processing if we still have pending softirqs.

Using need_resched() as the only condition can trigger RCU stalls, as we
can keep BH disabled for too long.

I ran several benchmarks and got no significant difference in throughput,
but a very significant reduction of latencies (one order of magnitude):

In following bench, 200 antagonist "netperf -t TCP_RR" are started in
background, using all available cpus.

Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC
IRQ (hard + soft).

Before patch:

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=550110.424
MIN_LATENCY=146858
MAX_LATENCY=997109
P50_LATENCY=305000
P90_LATENCY=550000
P99_LATENCY=710000
MEAN_LATENCY=376989.12
STDDEV_LATENCY=184046.92

After patch:

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=40545.492
MIN_LATENCY=9834
MAX_LATENCY=78366
P50_LATENCY=33583
P90_LATENCY=59000
P99_LATENCY=69000
MEAN_LATENCY=38364.67
STDDEV_LATENCY=12865.26

Signed-off-by: Eric Dumazet <[email protected]>
Cc: David Miller <[email protected]>
Cc: Tom Herbert <[email protected]>
Cc: Ben Hutchings <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

softirq: Fix lockup related to stop_machine being stuck in __do_softirq.

The stop machine logic can lock up if all but one of the migration threads
make it through the disable-irq step and the one remaining thread gets
stuck in __do_softirq. The reason __do_softirq can hang is that it has a
bail-out based on jiffies timeout, but in the lockup case, jiffies itself
is not incremented.

To work around this, re-add the max_restart counter in __do_irq and stop
processing irqs after 10 restarts.

Thanks to Tejun Heo and Rusty Russell and others for helping me track this
down.

This was introduced in 3.9 by commit c10d73671ad3
("softirq: reduce latencies").

It may be worth looking into ath9k to see if it has issues with its irq
handler at a later date.

The hang stack traces look something like this:

    ------------[ cut here ]------------
    WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7()
    Watchdog detected hard LOCKUP on cpu 2
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    Pid: 23, comm: migration/2 Tainted: G         C   3.9.4+ #11
    Call Trace:
     <NMI>   warn_slowpath_common+0x85/0x9f
      warn_slowpath_fmt+0x46/0x48
      watchdog_overflow_callback+0x9c/0xa7
      __perf_event_overflow+0x137/0x1cb
      perf_event_overflow+0x14/0x16
      intel_pmu_handle_irq+0x2dc/0x359
      perf_event_nmi_handler+0x19/0x1b
      nmi_handle+0x7f/0xc2
      do_nmi+0xbc/0x304
      end_repeat_nmi+0x1e/0x2e
     <<EOE>>
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0
    ---[ end trace 4947dfa9b0a4cec3 ]---
    BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17]
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    irq event stamp: 835637905
    hardirqs last  enabled at (835637904): __do_softirq+0x9f/0x257
    hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80
    softirqs last  enabled at (5654720): __do_softirq+0x1ff/0x257
    softirqs last disabled at (5654725): irq_exit+0x5f/0xbb
    CPU 1
    Pid: 17, comm: migration/1 Tainted: G        WC   3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
    RIP: tasklet_hi_action+0xf0/0xf0
    Process migration/1
    Call Trace:
     <IRQ>
      __do_softirq+0x117/0x257
      irq_exit+0x5f/0xbb
      smp_apic_timer_interrupt+0x8a/0x98
      apic_timer_interrupt+0x72/0x80
     <EOI>
      printk+0x4d/0x4f
      stop_machine_cpu_stop+0x22c/0x274
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0

Change-Id: Ib213d5a5c59c85ae16afde6f42a01e39d2c93fd7
Signed-off-by: Ben Greear <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Pekka Riikonen <[email protected]>
Cc: Eric Dumazet <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <[email protected]>
Signed-off-by: Zefan Li <[email protected]>
alesaiko pushed a commit that referenced this issue Mar 13, 2018
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior
of waking their tasks up on idle CPUs. The feature has too much of a
negative impact on other workloads however to apply globally. The
PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this
flag set, or tasks woken by tasks with this flag set, on an idle CPU
if one is available.

Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039
Signed-off-by: Steve Muckle <[email protected]>

sched: add sysctl for controlling task migrations on wake

The PF_WAKE_UP_IDLE per-task flag made it impossible to enable
the old behavior of SD_SHARE_PKG_RESOURCES, where every task
migrates to an idle CPU on wakeup.

The sched_wake_to_idle sysctl value, when made nonzero, will cause
all tasks to migrate to an idle CPU if one is available when the
task is woken up. This is regardless of how PF_WAKE_UP_IDLE is
configured for tasks in the system. Similar to PF_WAKE_UP_IDLE,
the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled
for the sysctl value to have an effect.

Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61
Signed-off-by: Steve Muckle <[email protected]>
(cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282)

Conflicts:

	kernel/sched/fair.c

sched: Reset rq->next_interval before going idle

next_balance, the point in jiffy time scale when a cpu will next load
balance, could have been calculated when the cpu was busy. A busy cpu
will apply its sched domain's busy_factor (usually > 1) in computing
next_balance for that sched domain, which causes the (busy) cpu to load
balance less frequently in its sched domains. However when the same cpu
is going idle, its next_balance needs to be reset without consideration
of busy_factor. Failure to do so would not trigger nohz idle balancer on
that cpu for unnecessarily long time (introducing additional scheduling
latencies for tasks). Fix bug in scheduler which aims to reset
next_balance before a cpu goes idle (as per existing comment) but is
clearly not doing so.

Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
(just when they would have exceeded their ideal_runtime). It makes use
of a a per-cpu hrtimer resource and hence alarming that hrtimer should
be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs,
rather than being based on number of tasks in a particular cfs_rq (as
implemented currently). As a result, with current code, its possible for
a running task (which is the sole task in its cfs_rq) to be preempted
much after its ideal_runtime has elapsed, resulting in increased latency
for tasks in other cfs_rq on same cpu.

Fix this by alarming sched hrtimer based on total number of SCHED_FAIR
tasks a CPU has across its various cfs_rqs.

Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Mark schedule_io_timeout() with EXPORT_SYMBOL

Make schedule_io_timeout() visible to modules.

Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2
Signed-off-by: Jordan Crouse <[email protected]>

sched: fix reference to wrong cfs_rq

Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption
of tasks) introduced a bug in sched_slice() calculation by using wrong
cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather
than the correct one to which they belonged.

Fix the bug by using correct cfs_rq for tasks.

Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Make sure to not re-read variables after validation

We could re-read rq->rt_avg after we validated it was smaller than
total, invalidating the check and resulting in an unintended negative.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: David Rientjes <[email protected]>
Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
CRs-Fixed: 497236
Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
(cherry picked from commit fef154fa32c5d8ef992f11e11919495979970678)

Change-Id: Ieac7a10d9f4ceb701660f8055efd7a0ed5d939ca
Signed-off-by: Sridhar Nelloru <[email protected]>

sched: re-calculate a cpu's next_balance point upon sched domain changes

Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset
a cpu's rq->next_balance when pulled_task = 0, which will be true when
the cpu failed to pull any task, causing it go idle. However that patch
relied on next_balance being calculated as a result of traversing cpu's
sched domain hierarchy.

A cpu that is the only online cpu will however not be attached to any
sched domain hierarchy. When such a cpu calls into idle_balance(), we
will end up initializing next_balance to be 1sec away! Such a CPU will
defer load balance check for another 1sec, even though we may bring up
more cpus in the meantime requiring it to check for load imbalance more
frequently. This could then lead to increased scheduling latency for
some tasks.

This patch results in a cpu's next_balance being re-calculated when its
attaching to a new sched domain hierarchy.  This should let cpus call
load balance checks at the right time we expect them to!

Signed-off-by: Srivatsa Vaddagiri <[email protected]>
(cherry picked from commit 91b609bdd91513c37c410ffef0204f8a0578e4f8)

Change-Id: Ib457f1a357aad26e49837a01311b73e22e0cc26f
Signed-off-by: Sridhar Nelloru <[email protected]>

sched: Unthrottle rt runqueues in __disable_runtime()

migrate_tasks() uses _pick_next_task_rt() to get tasks from the
real-time runqueues to be migrated. When rt_rq is throttled
_pick_next_task_rt() won't return anything, in which case
migrate_tasks() can't move all threads over and gets stuck in an
infinite loop.

Instead unthrottle rt runqueues before migrating tasks.

Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()

Signed-off-by: Peter Boonstoppel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix clear NOHZ_BALANCE_KICK

I have faced a sequence where the Idle Load Balance was sometime not
triggered for a while on my platform, in the following scenario:

 CPU 0 and CPU 1 are running tasks and CPU 2 is idle

 CPU 1 kicks the Idle Load Balance
 CPU 1 selects CPU 2 as the new Idle Load Balancer
 CPU 2 sets NOHZ_BALANCE_KICK for CPU 2
 CPU 2 sends a reschedule IPI to CPU 2

 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2

 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
       task A quickly goes back to sleep (before a tick occurs on CPU 2)
 CPU 2 goes back to idle with NOHZ_BALANCE_KICK set

Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent
because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be
performed.

We must wait for the sched softirq to be raised on CPU 2 thanks to another
part the kernel to come back to clear NOHZ_BALANCE_KICK.

The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
we can't raise the sched_softirq for the Idle Load Balance.

Change since V1:

- move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB
  can't run on this CPU (as suggested by Peter)

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 873b4c65b519fd769940eb281f77848227d4e5c1
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: minor merge resolution for 3.4 in scheduler_ipi()]
Signed-off-by: Steve Muckle <[email protected]>
Change-Id: I3548612057cccc2ecc29429c129c44183083831f

sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s

try_to_wake_up_local() should only be invoked to wake up another
task in the same runqueue and BUG_ON()s are used to enforce the
rule. Missing try_to_wake_up_local() can stall workqueue
execution but such stalls are likely to be finite either by
another work item being queued or the one blocked getting
unblocked.  There's no reason to trigger BUG while holding rq
lock crashing the whole system.

Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s.

Change-Id: I75fdaaf4dcaefcf3893e0404d98c8aa91f89934d
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Git-commit: 383efcd00053ec40023010ce5034bd702e7ab373
CRs-Fixed: 519942
Signed-off-by: Steve Muckle <[email protected]>

sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local()

The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local()
were recently converted from BUG_ON() calls. If these hit it indicates
something is wrong and that may contribute to other system instability.
To eliminate the risk of an instance of one of these errors going
un-noticed because there was an earlier instance that occured long ago,
change to WARN_ON(). If there ever is a flood of these there are bigger
problems.

Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7
Signed-off-by: Steve Muckle <[email protected]>

sched: Convert WARN_ON() to printk_sched() in try_to_wake_up_local()

try_to_wake_up_local() is called with the rq lock held. Printing to
console in this context can result in a deadlock if klogd needs to
be woken up. Print to the kernel log buffer via printk_sched()
instead which avoids the wakeup.

Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c
Signed-off-by: Syed Rameez Mustafa <[email protected]>

sched/rt: Use root_domain of rt_rq not current processor

When the system has multiple domains do_sched_rt_period_timer()
can run on any CPU and may iterate over all rt_rq in
cpu_online_mask.  This means when balance_runtime() is run for a
given rt_rq that rt_rq may be in a different rd than the current
processor.  Thus if we use smp_processor_id() to get rd in
do_balance_runtime() we may borrow runtime from a rt_rq that is
not part of our rd.

This changes do_balance_runtime to get the rd from the passed in
rt_rq ensuring that we borrow runtime only from the correct rd
for the given rt_rq.

This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime
when we try reclaim runtime lent to other rt_rq but runtime has
been lent to a rt_rq in another rd.

Change-Id: Id8b9f88bdd783ef87bfa8d288a2fe7e86dd0b1e2
Signed-off-by: Shawn Bohrer <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Acked-by: Mike Galbraith <[email protected]>
Cc: [email protected]
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 89960feebaf4f9a53f93a0ce6888207e4a808799
Git-repo: git://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git
Signed-off-by: Syed Rameez Mustafa <[email protected]>

sched: Remove redundant update_runtime notifier

migration_call() will do all the things that update_runtime() does.
So let's remove it.

Furthermore, there is potential risk that the current code will catch
BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime
threads running because of enabling runtime twice while the rt_runtime
may already changed.

Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8
Signed-off-by: Neil Zhang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: resolved trivial header file context conflict]
Signed-off-by: Matt Wagantall <[email protected]>

sched: Fix hotplug vs. set_cpus_allowed_ptr()

Lai found that:

  WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b()
  ...
  migration_cpu_stop+0x1d/0x22

was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is
always a sub-set of cpu_online_mask.

This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness").

So set active and online at the same time to avoid this particular
problem.

CRs-Fixed: 680496
Change-Id: I89ac9b6829acf200072975bc7d028a469167f083
Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness")
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Gautham R. Shenoy <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael wang <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srivatsa S. Bhat <[email protected]>
Cc: Toshi Kani <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 24d52daafcd53f109893ec2faf668c8eeef4a382
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Matt Wagantall <[email protected]>

sched: Fix race between try_to_wake_up() and move_task()

Until a task's state has been seen as interruptible/uninterruptible
and it is no longer on_cpu, it is possible that the task may move
to another CPU (load balancing may cause this). Here is an example
where the race condition results in incorrect operation:

- cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING
- cpu 0 runs task B, which attempts to wake up A
- cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0
- cpu 1 then pulls task A (perhaps due to idle balance)
- cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE
- cpu 0 continues in try_to_wake_up(), thinking task A's previous
  cpu is 0, where it is actually 1
- if select_task_rq returns cpu 0, task A will be woken up on cpu 0
  without properly updating its cpu to 0 in set_task_cpu()

CRs-Fixed: 665958
Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99
Signed-off-by: Steve Muckle <[email protected]>

sched: Fix reference to stale task_struct in try_to_wake_up()

try_to_wake_up() currently drops p->pi_lock and later checks for need
to notify cpufreq governor on task migrations or wakeups. However the
woken task could exit between the time p->pi_lock is released and the
time the test for notification is run. As a result, the test for
notification could refer to an exited task. task_notify_on_migrate(p)
could thus lead to invalid memory reference.

Fix this by running the test for notification with task's pi_lock
held.

Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Bail out of yield_to when source and target runqueue has one task

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple handler upon
 seeing source runqueue length = 1, but it had to export rq length).
 Peter came up with the elegant idea of return -ESRCH in scheduler core.

Signed-off-by: Peter Zijlstra <[email protected]>
Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Andrew Jones <[email protected]>
Tested-by: Chegu Vinod <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>

Change-Id: I4eca8afc683c1a4b9d2490422baacd3da76c1e92

sched: Fix select_idle_sibling() bouncing cow syndrome

If the previous CPU is cache affine and idle, select it.

The current implementation simply traverses the sd_llc domain,
taking the first idle CPU encountered, which walks buddy pairs
hand in hand over the package, inflicting excruciating pain.

1 tbench pair (worst case) in a 10 core + SMT package:

  pre   15.22 MB/sec 1 procs
  post 252.01 MB/sec 1 procs

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix double normalization of vruntime

dequeue_entity() is called when p->on_rq and sets se->on_rq = 0
which appears to guarentee that the !se->on_rq condition is met.
If the task has done set_current_state(TASK_INTERRUPTIBLE) without
schedule() the second condition will be met and vruntime will be
incorrectly adjusted twice.

In certain cases this can result in the task's vruntime never increasing
past the vruntime of other tasks on the CFS' run queue, starving them of
CPU time.

This patch changes switched_from_fair() to use !p->on_rq instead of
!se->on_rq.

I'm able to cause a task with a priority of 120 to starve all other
tasks with the same priority on an ARM platform running 3.2.51-rt72
PREEMPT RT by writing one character at time to a serial tty (16550 UART)
in a tight loop. I'm also able to verify making this change corrects the
problem on that platform and kernel version.

Change-Id: I70aa0fe0e9be146b1a2a903d010b318390262dc3
Signed-off-by: George McCollister <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix divide by zero at {thread_group,task}_times

On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.

This problem can be triggered in practice on very long lived processes:

  PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
   #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
   #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
   #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
   #3 [ffff880472a51cd0] die at ffffffff8100f26b
   #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
   #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
   #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
      [exception RIP: thread_group_times+0x56]
      RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
      RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
      RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
      RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
      R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
   #8 [ffff880472a51f40] sys_times at ffffffff81088524
   #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
      RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
      RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
      RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
      RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
      R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
      ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b

Change-Id: I1e0fa1fa38b86566f5b20e5c0efa0c9eb17b203b
Signed-off-by: Stanislaw Gruszka <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched_clock: Prevent 64bit inatomicity on 32bit systems

The sched_clock_remote() implementation has the following inatomicity
problem on 32bit systems when accessing the remote scd->clock, which
is a 64bit value.

CPU0			    CPU1

sched_clock_local()	    sched_clock_remote(CPU0)
...
			    remote_clock = scd[CPU0]->clock
			    read_low32bit(scd[CPU0]->clock)
cmpxchg64(scd->clock,...)
			    read_high32bit(scd[CPU0]->clock)

While the update of scd->clock is using an atomic64 mechanism, the
readout on the remote cpu is not, which can cause completely bogus
readouts.

It is a quite rare problem, because it requires the update to hit the
narrow race window between the low/high readout and the update must go
across the 32bit boundary.

The resulting misbehaviour is, that CPU1 will see the sched_clock on
CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
to this bogus timestamp. This stays that way due to the clamping
implementation for about 4 seconds until the synchronization with
CLOCK_MONOTONIC undoes the problem.

The issue is hard to observe, because it might only result in a less
accurate SCHED_OTHER timeslicing behaviour. To create observable
damage on realtime scheduling classes, it is necessary that the bogus
update of CPU1 sched_clock happens in the context of an realtime
thread, which then gets charged 4 seconds of RT runtime, which results
in the RT throttler mechanism to trigger and prevent scheduling of RT
tasks for a little less than 4 seconds. So this is quite unlikely as
well.

The issue was quite hard to decode as the reproduction time is between
2 days and 3 weeks and intrusive tracing makes it less likely, but the
following trace recorded with trace_clock=global, which uses
sched_clock_local(), gave the final hint:

  <idle>-0   0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
  <idle>-0   0d..30 400269.477151: hrtimer_start:  hrtimer=0xf7061e80 ...
irq/20-S-587 1d..32 400273.772118: sched_wakeup:   comm= ... target_cpu=0
  <idle>-0   0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80

What happens is that CPU0 goes idle and invokes
sched_clock_idle_sleep_event() which invokes sched_clock_local() and
CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
sched_remote_clock(). The time jump gets propagated to CPU0 via
sched_remote_clock() and stays stale on both cores for ~4 seconds.

There are only two other possibilities, which could cause a stale
sched clock:

1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
   wrong value.

2) sched_clock() which reads the TSC returns a sporadic wrong value.

   one jiffy and then go stale.

   forward. It would just result in a stale sched_clock for one jiffy.

After quite some brain twisting and finding the same pattern on other
traces, sched_clock_remote() remained the only place which could cause
such a problem and as explained above it's indeed racy on 32bit
systems.

So while on 64bit systems the readout is atomic, we need to verify the
remote readout on 32bit machines. We need to protect the local->clock
readout in sched_clock_remote() on 32bit as well because an NMI could
hit between the low and the high readout, call sched_clock_local() and
modify local->clock.

Thanks to Siegfried Wulsch for bearing with my debug requests and
going through the tedious tasks of running a bunch of reproducer
systems to generate the debug information which let me decode the
issue.

Change-Id: I767a6a645e17651b39d0a530dc0681756f3b4f5d
Reported-by: Siegfried Wulsch <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities

This patch touches the RT group scheduling case.

Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
priority, while rt_rq passed to them may be not the top-level rt_rq.
This is wrong, because changing of priority on a child level does not
guarantee that the priority is the highest all over the rq. So, this
leak makes RT balancing unusable.

The short example: the task having the highest priority among all rq's
RT tasks (no one other task has the same priority) are waking on a
throttle rt_rq.  The rq's cpupri is set to the task's priority
equivalent, but real rq->rt.highest_prio.curr is less.

The patch below fixes the problem.

Change-Id: I01d3cc07870b1825efa1deed3b70f060ac20939f
Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
CC: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Guarantee new group-entities always have weight

Currently, group entity load-weights are initialized to zero. This
admits some races with respect to the first time they are re-weighted in
earlty use. ( Let g[x] denote the se for "g" on cpu "x". )

Suppose that we have root->a and that a enters a throttled state,
immediately followed by a[0]->t1 (the only task running on cpu[0])
blocking:

  put_prev_task(group_cfs_rq(a[0]), t1)
  put_prev_entity(..., t1)
  check_cfs_rq_runtime(group_cfs_rq(a[0]))
  throttle_cfs_rq(group_cfs_rq(a[0]))

Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
time:

  enqueue_task_fair(rq[0], t2)
  enqueue_entity(group_cfs_rq(b[0]), t2)
  enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
  account_entity_enqueue(group_cfs_ra(b[0]), t2)
  update_cfs_shares(group_cfs_rq(b[0]))
  < skipped because b is part of a throttled hierarchy >
  enqueue_entity(group_cfs_rq(a[0]), b[0])
  ...

We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
which violates invariants in several code-paths. Eliminate the
possibility of this by initializing group entity weight.

Change-Id: I1b09c5c4f85df1de565e7c862ba8dde65851c05f
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Chris J Arges <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Fix isolated CPUs leaving root_task_group indefinitely throttled

Root task group bandwidth replenishment must service all CPUs, regardless of
where the timer was last started, and regardless of the isolation mechanism,
lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.

Change-Id: I8485724e3ff0960008aa13634e55bcb6ceacbc01
Signed-off-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Avoid updating RT entry timeout twice within one tick period

The issue below was found in 2.6.34-rt rather than mainline rt kernel, but
the issue still exists upstream as well.

So please let me describe how it was noticed on 2.6.34-rt:

On this version, each softirq has its own thread, which means there is at
least one RT FIFO task per cpu.  The priority of these tasks is set to 49
by default.  If user launches an RT FIFO task with priority lower than 49
of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued
one cpu runqueue at one moment. By current strategy of balancing RT tasks,
when it comes to RT tasks, we really need to put them off to a CPU that
they can run on as soon as possible. Even if it means a bit of cache line
flushing, we want RT tasks to be run with the least latency.

When the user RT FIFO task which just launched before is running, the
sched timer tick of the current cpu happens. In this tick period, the
timeout value of the user RT task will be updated once. Subsequently,
we try to wake up one softirq RT task on its local cpu. As the priority
of current user RT task is lower than the softirq RT task, the current
task will be preempted by the higher priority softirq RT task.  Before
preemption, we check to see if current can readily move to a different
cpu. If so, we will reschedule to allow the RT push logic to try to move
current somewhere else. Whenever the woken softirq RT task runs, it first
tries to migrate the user FIFO RT task over to a cpu that is running a
task of lesser priority. If migration is done, it will send a reschedule
request to the found cpu by IPI interrupt. Once the target cpu responds
the IPI interrupt, it will pick the migrated user RT task to preempt its
current task. When the user RT task is running on the new cpu, the sched
timer tick of the cpu fires. So it will tick the user RT task again. This
also means the RT task timeout value will be updated again.  As the
migration may be done in one tick period, it means the user RT task
timeout value will be updated twice within one tick.

If we set a limit on the amount of cpu time for the user RT task by
setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon
reaching the soft limit.

But exactly when the SIGXCPU signal should be sent depends on the RT task
timeout value. In fact the timeout mechanism of sending the SIGXCPU signal
assumes the RT task timeout is increased once every tick.

However, currently the timeout value may be added twice per tick. So it
results in the SIGXCPU signal being sent earlier than expected.

To solve this issue, we prevent the timeout value from increasing twice
within one tick time by remembering the jiffies value of last updating the
timeout. As long as the RT task's jiffies is different with the global
jiffies value, we allow its timeout to be updated.

Signed-off-by: Ying Xue <[email protected]>
Signed-off-by: Fan Du <[email protected]>
Reviewed-by: Yong Zhang <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
[ lizf: backported to 3.4: adjust context ]
Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Change-Id: I1d126959d88f3bc1adb4a707661efb7162048424

sched: rt: Reduce rq lock contention by eliminating locking of non-feasible target

This patch adds checks that prevens futile attempts to move rt tasks to a
CPU with active tasks of equal or higher priority.

This reduces run queue lock contention and improves the performance of a
well known OLTP benchmark by 0.7%.

Change-Id: I232abe50623593d51be508fc64e8b865dcc77212
Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Shawn Bohrer <[email protected]>
Cc: Suruchi Kadu <[email protected]>
Cc: Doug Nelson<[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

sched: Fix RLIMIT_RTTIME when PI-boosting to RT

When non-realtime tasks get priority-inheritance boosted to a realtime
scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
counter used for checking this (the same one used for SCHED_RR
timeslices) was not getting reset. This meant that tasks running with a
non-realtime scheduling class which are repeatedly boosted to a realtime
one, but never block while they are running realtime, eventually hit the
timeout without ever running for a time over the limit. This patch
resets the realtime timeslice counter when un-PI-boosting from an RT to
a non-RT scheduling class.

I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
mutex which induces priority boosting and spins while boosted that gets
killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
does happen eventually with PREEMPT_VOLUNTARY kernels.

Change-Id: I732036119517d9a9625cea4644b4509c1d6cdf13
Signed-off-by: Brian Silverman <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
[lizf: Backported to 3.4: adjust contest]
Signed-off-by: Zefan Li <[email protected]>

sched: Queue RT tasks to head when prio drops

The following scenario does not work correctly:

Runqueue of CPUx contains two runnable and pinned tasks:

 T1: SCHED_FIFO, prio 80
 T2: SCHED_FIFO, prio 80

T1 is on the cpu and executes the following syscalls (classic priority
ceiling scenario):

 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90);
 ...
 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80);
 ...

Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back
to sleep the scheduler picks T2. Surprise!

The same happens w/o actual preemption when T1 is forced into the
scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes
pick_next_task() which returns T2. So T1 gets preempted and scheduled
out.

This happens because sched_setscheduler() dequeues T1 from the prio 90
list and then enqueues it on the tail of the prio 80 list behind T2.
This violates the POSIX spec and surprises user space which relies on
the guarantee that SCHED_FIFO tasks are not scheduled out unless they
give the CPU up voluntarily or are preempted by a higher priority
task. In the latter case the preempted task must get back on the CPU
after the preempting task schedules out again.

We fixed a similar issue already in commit 60db48c (sched: Queue a
deboosted task to the head of the RT prio queue). The same treatment
is necessary for sched_setscheduler(). So enqueue to head of the prio
bucket list if the priority of the task is lowered.

It might be possible that existing user space relies on the current
behaviour, but it can be considered highly unlikely due to the corner
case nature of the application scenario.

Change-Id: I65f159b3c2d265dcaf27b70bf9be88cf922ab845
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

sched: Fix the theoretical signal_wake_up() vs schedule() race

This is only theoretical, but after try_to_wake_up(p) was changed to check
p->state under p->pi_lock the code like:

	__set_current_state(TASK_INTERRUPTIBLE);
	schedule();

can miss a signal.  This is the special case of wait-for-condition, it
relies on try_to_wake_up/schedule interaction and thus it does not need
mb() between __set_current_state() and if (signal_pending).

However, this __set_current_state() can move into the critical section
protected by rq->lock, now that try_to_wake_up() takes another lock we
need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.

The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock).  This is what try_to_wake_up() already
does by the same reason.

We turn this wmb() into the new helper, smp_mb__before_spinlock(), for
better documentation and to allow the architectures to change the default
implementation.

While at it, kill smp_mb__after_lock(), it has no callers.

Perhaps we can also add smp_mb__before/after_spinunlock() for
prepare_to_wait().

Change-Id: I2b329b2ae213cc9221811ca244d2ebb044011a1b
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix race in idle_balance()

The scheduler main function 'schedule()' checks if there are no more tasks
on the runqueue. Then it checks if a task should be pulled in the current
runqueue in idle_balance() assuming it will go to idle otherwise.

But idle_balance() releases the rq->lock in order to look up the sched
domains and takes the lock again right after. That opens a window where
another cpu may put a task in our runqueue, so we won't go to idle but
we have filled the idle_stamp, thinking we will.

This patch closes the window by checking if the runqueue has been modified
but without pulling a task after taking the lock again, so we won't go to idle
right after in the __schedule() function.

Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8
Signed-off-by: Daniel Lezcano <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix load avg vs cpu-hotplug

Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.

Reported-by: Rakib Mullick <[email protected]>
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix load avg vs. cpu-hotplug

Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an
incomplete fix:

In particular, the problem is that at the point it calls
calc_load_migrate() nr_running := 1 (the stopper thread), so move the
call to CPU_DEAD where we're sure that nr_running := 0.

Also note that we can call calc_load_migrate() without serialization, we
know the state of rq is stable since its cpu is dead, and we modify the
global state using appropriate atomic ops.

Change-Id: I24063eebf4429d7e01838af2cb68e0f3c2cf2131
Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>

sched/nohz: Rewrite and fix load-avg computation -- again

Thanks to Charles Wang for spotting the defects in the current code:

 - If we go idle during the sample window -- after sampling, we get a
   negative bias because we can negate our own sample.

 - If we wake up during the sample window we get a positive bias
   because we push the sample to a known active period.

So rewrite the entire nohz load-avg muck once again, now adding
copious documentation to the code.

Reported-and-tested-by: Doug Smythies <[email protected]>
Reported-and-tested-by: Charles Wang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
[ minor edits ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Change-Id: Ib735002c79eaace9fa5f7795fdc2f80b638b56a7

sched: Fix the broken sched_rr_get_interval()

The caller of sched_sliced() should pass se.cfs_rq and se as the
arguments, however in sched_rr_get_interval() we gave it
rq.cfs_rq and se, which made the following computation obviously
wrong.

The change was introduced by commit:

  77034937dc45 sched: fix crash in sys_sched_rr_get_interval()

... 5 years ago, while it had been the correct 'cfs_rq_of' before
the commit. The change seems to be irrelevant to the commit
msg, which was to return a 0 timeslice for tasks that are on an
idle runqueue. So I believe that was just a plain typo.

Signed-off-by: Zhu Yanhai <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Since this is an ABI and an old bug, we'll test this via a
  slow upstream route, to hopefully discover any app breakage. ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix race in task_group()

Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call
task_group() too many times in set_task_rq()").  He found the reason to be
that the multiple task_group() invocations in set_task_rq() returned
different values.

Looking at all that I found a lack of serialization and plain wrong
comments.

The below tries to fix it using an extra pointer which is updated under
the appropriate scheduler locks. Its not pretty, but I can't really see
another way given how all the cgroup stuff works.

Change-Id: I6906c92ea6f66d19af5938a9f5ec9f1ff4577c39
Reported-and-tested-by: Stefan Bader <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume

In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.

However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.

In order to achieve that, do the following:

1. Don't modify cpusets during suspend/resume. At all.
   In particular, don't move the tasks from one cpuset to another, and
   don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
   during the CPU hotplug operations that are carried out in the
   suspend/resume path.

2. However, cpusets and sched domains are related. We just want to avoid
   altering cpusets alone. So, to keep the sched domains updated, build
   a single sched domain (containing all active cpus) during each of the
   CPU hotplug operations carried out in s/r path, effectively ignoring
   the cpusets' cpus_allowed masks.

   (Since userspace is frozen while doing all this, it will go unnoticed.)

3. During the last CPU online operation during resume, build the sched
   domains by looking up the (unaltered) cpusets' cpus_allowed masks.
   That will bring back the system to the same original state as it was in
   before suspend.

Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.

Change-Id: I12c3e6289d5aa069531f8f616d63412b38103d99
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Preeti U Murthy <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones

There is a small race between copy_process() and cgroup_attach_task()
where child->se.parent,cfs_rq points to invalid (old) ones.

        parent doing fork()      | someone moving the parent to another cgroup
  -------------------------------+---------------------------------------------
    copy_process()
      + dup_task_struct()
        -> parent->se is copied to child->se.
           se.parent,cfs_rq of them point to old ones.

                                     cgroup_attach_task()
                                       + cgroup_task_migrate()
                                         -> parent->cgroup is updated.
                                       + cpu_cgroup_attach()
                                         + sched_move_task()
                                           + task_move_group_fair()
                                             +- set_task_rq()
                                                -> se.parent,cfs_rq of parent
                                                   are updated.

      + cgroup_fork()
        -> parent->cgroup is copied to child->cgroup. (*1)
      + sched_fork()
        + task_fork_fair()
          -> se.parent,cfs_rq of child are accessed
             while they point to old ones. (*2)

In the worst case, this bug can lead to "use-after-free" and cause a panic,
because it's new cgroup's refcount that is incremented at (*1),
so the old cgroup(and related data) can be freed before (*2).

In fact, a panic caused by this bug was originally caught in RHEL6.4.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0
    [...]
    Call Trace:
     [<ffffffff81051f25>] place_entity+0x75/0xa0
     [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160
     [<ffffffff81063c0b>] sched_fork+0x6b/0x140
     [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450
     [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130
     [<ffffffff8106d2f4>] do_fork+0x94/0x460
     [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100
     [<ffffffff81009598>] sys_clone+0x28/0x30
     [<ffffffff8100b393>] stub_clone+0x13/0x20
     [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

Signed-off-by: Daisuke Nishimura <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/nohz: Fix rq->cpu_load[] calculations

While investigating why the load-balancer did funny I found that the
rq->cpu_load[] tables were completely screwy.. a bit more digging
revealed that the updates that got through were missing ticks followed
by a catchup of 2 ticks.

The catchup assumes the cpu was idle during that time (since only nohz
can cause missed ticks and the machine is idle etc..) this means that
esp. the higher indices were significantly lower than they ought to
be.

The reason for this is that its not correct to compare against jiffies
on every jiffy on any other cpu than the cpu that updates jiffies.

This patch cludges around it by only doing the catch-up stuff from
nohz_idle_balance() and doing the regular stuff unconditionally from
the tick.

Change-Id: Iac6eda1ded5074afb261ccc0fe8a4e70432bb031
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Venkatesh Pallipadi <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/nohz: Fix rq->cpu_load calculations some more

commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream.

Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[]
calculations") since while that fixed the busy case it regressed the
mostly idle case.

Add a callback from the nohz exit to also age the rq->cpu_load[]
array. This closes the hole where either there was no nohz load
balance pass during the nohz, or there was a 'significant' amount of
idle time between the last nohz balance and the nohz exit.

So we'll update unconditionally from the tick to not insert any
accidental 0 load periods while busy, and we try and catch up from
nohz idle balance and nohz exit. Both these are still prone to missing
a jiffy, but that has always been the case.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Venkatesh Pallipadi <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/core: Clear the root_domain cpumasks in init_rootdomain()

root_domain::rto_mask allocated through alloc_cpumask_var() contains
garbage data, and this may cause problems.  For instance, when doing
pull_rt_task(), it may do useless iterations if rto_mask retains some
extra garbage bits. Worse still, this violates the isolated domain rule
for clustered scheduling using cpuset, because the tasks (with all the
cpus allowed) belongs to one root domain can be pulled away into another
root domain.

The patch cleans the garbage by using zalloc_cpumask_var() instead of
alloc_cpumask_var() for root_domain::rto_mask allocation, thereby
addressing the issues.

Do the same thing for root_domain's other cpumask memembers:
- span,
- online;

Signed-off-by: Xunlei Pang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

softirq: Reduce latencies

In various network workloads, __do_softirq() latencies can be up to 20 ms
if HZ=1000, and 200 ms if HZ=100.

This is because we iterate 10 times in the softirq dispatcher, and some
actions can consume a lot of cycles.

This patch changes the fallback to ksoftirqd condition to:
- A time limit of 2 ms,
- need_resched() being set on current task;

When one of this condition is met, we wakeup ksoftirqd for further softirq
processing if we still have pending softirqs.

Using need_resched() as the only condition can trigger RCU stalls, as we
can keep BH disabled for too long.

I ran several benchmarks and got no significant difference in throughput,
but a very significant reduction of latencies (one order of magnitude):

In following bench, 200 antagonist "netperf -t TCP_RR" are started in
background, using all available cpus.

Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC
IRQ (hard + soft).

Before patch:

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=550110.424
MIN_LATENCY=146858
MAX_LATENCY=997109
P50_LATENCY=305000
P90_LATENCY=550000
P99_LATENCY=710000
MEAN_LATENCY=376989.12
STDDEV_LATENCY=184046.92

After patch:

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=40545.492
MIN_LATENCY=9834
MAX_LATENCY=78366
P50_LATENCY=33583
P90_LATENCY=59000
P99_LATENCY=69000
MEAN_LATENCY=38364.67
STDDEV_LATENCY=12865.26

Signed-off-by: Eric Dumazet <[email protected]>
Cc: David Miller <[email protected]>
Cc: Tom Herbert <[email protected]>
Cc: Ben Hutchings <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

softirq: Fix lockup related to stop_machine being stuck in __do_softirq.

The stop machine logic can lock up if all but one of the migration threads
make it through the disable-irq step and the one remaining thread gets
stuck in __do_softirq. The reason __do_softirq can hang is that it has a
bail-out based on jiffies timeout, but in the lockup case, jiffies itself
is not incremented.

To work around this, re-add the max_restart counter in __do_irq and stop
processing irqs after 10 restarts.

Thanks to Tejun Heo and Rusty Russell and others for helping me track this
down.

This was introduced in 3.9 by commit c10d73671ad3
("softirq: reduce latencies").

It may be worth looking into ath9k to see if it has issues with its irq
handler at a later date.

The hang stack traces look something like this:

    ------------[ cut here ]------------
    WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7()
    Watchdog detected hard LOCKUP on cpu 2
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    Pid: 23, comm: migration/2 Tainted: G         C   3.9.4+ #11
    Call Trace:
     <NMI>   warn_slowpath_common+0x85/0x9f
      warn_slowpath_fmt+0x46/0x48
      watchdog_overflow_callback+0x9c/0xa7
      __perf_event_overflow+0x137/0x1cb
      perf_event_overflow+0x14/0x16
      intel_pmu_handle_irq+0x2dc/0x359
      perf_event_nmi_handler+0x19/0x1b
      nmi_handle+0x7f/0xc2
      do_nmi+0xbc/0x304
      end_repeat_nmi+0x1e/0x2e
     <<EOE>>
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0
    ---[ end trace 4947dfa9b0a4cec3 ]---
    BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17]
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    irq event stamp: 835637905
    hardirqs last  enabled at (835637904): __do_softirq+0x9f/0x257
    hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80
    softirqs last  enabled at (5654720): __do_softirq+0x1ff/0x257
    softirqs last disabled at (5654725): irq_exit+0x5f/0xbb
    CPU 1
    Pid: 17, comm: migration/1 Tainted: G        WC   3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
    RIP: tasklet_hi_action+0xf0/0xf0
    Process migration/1
    Call Trace:
     <IRQ>
      __do_softirq+0x117/0x257
      irq_exit+0x5f/0xbb
      smp_apic_timer_interrupt+0x8a/0x98
      apic_timer_interrupt+0x72/0x80
     <EOI>
      printk+0x4d/0x4f
      stop_machine_cpu_stop+0x22c/0x274
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0

Change-Id: Ib213d5a5c59c85ae16afde6f42a01e39d2c93fd7
Signed-off-by: Ben Greear <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Pekka Riikonen <[email protected]>
Cc: Eric Dumazet <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <[email protected]>
Signed-off-by: Zefan Li <[email protected]>
alesaiko pushed a commit that referenced this issue Mar 14, 2018
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior
of waking their tasks up on idle CPUs. The feature has too much of a
negative impact on other workloads however to apply globally. The
PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this
flag set, or tasks woken by tasks with this flag set, on an idle CPU
if one is available.

Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039
Signed-off-by: Steve Muckle <[email protected]>

sched: add sysctl for controlling task migrations on wake

The PF_WAKE_UP_IDLE per-task flag made it impossible to enable
the old behavior of SD_SHARE_PKG_RESOURCES, where every task
migrates to an idle CPU on wakeup.

The sched_wake_to_idle sysctl value, when made nonzero, will cause
all tasks to migrate to an idle CPU if one is available when the
task is woken up. This is regardless of how PF_WAKE_UP_IDLE is
configured for tasks in the system. Similar to PF_WAKE_UP_IDLE,
the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled
for the sysctl value to have an effect.

Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61
Signed-off-by: Steve Muckle <[email protected]>
(cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282)

Conflicts:

	kernel/sched/fair.c

sched: Reset rq->next_interval before going idle

next_balance, the point in jiffy time scale when a cpu will next load
balance, could have been calculated when the cpu was busy. A busy cpu
will apply its sched domain's busy_factor (usually > 1) in computing
next_balance for that sched domain, which causes the (busy) cpu to load
balance less frequently in its sched domains. However when the same cpu
is going idle, its next_balance needs to be reset without consideration
of busy_factor. Failure to do so would not trigger nohz idle balancer on
that cpu for unnecessarily long time (introducing additional scheduling
latencies for tasks). Fix bug in scheduler which aims to reset
next_balance before a cpu goes idle (as per existing comment) but is
clearly not doing so.

Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
(just when they would have exceeded their ideal_runtime). It makes use
of a a per-cpu hrtimer resource and hence alarming that hrtimer should
be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs,
rather than being based on number of tasks in a particular cfs_rq (as
implemented currently). As a result, with current code, its possible for
a running task (which is the sole task in its cfs_rq) to be preempted
much after its ideal_runtime has elapsed, resulting in increased latency
for tasks in other cfs_rq on same cpu.

Fix this by alarming sched hrtimer based on total number of SCHED_FAIR
tasks a CPU has across its various cfs_rqs.

Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Mark schedule_io_timeout() with EXPORT_SYMBOL

Make schedule_io_timeout() visible to modules.

Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2
Signed-off-by: Jordan Crouse <[email protected]>

sched: fix reference to wrong cfs_rq

Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption
of tasks) introduced a bug in sched_slice() calculation by using wrong
cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather
than the correct one to which they belonged.

Fix the bug by using correct cfs_rq for tasks.

Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Make sure to not re-read variables after validation

We could re-read rq->rt_avg after we validated it was smaller than
total, invalidating the check and resulting in an unintended negative.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: David Rientjes <[email protected]>
Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
CRs-Fixed: 497236
Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
(cherry picked from commit fef154fa32c5d8ef992f11e11919495979970678)

Change-Id: Ieac7a10d9f4ceb701660f8055efd7a0ed5d939ca
Signed-off-by: Sridhar Nelloru <[email protected]>

sched: re-calculate a cpu's next_balance point upon sched domain changes

Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset
a cpu's rq->next_balance when pulled_task = 0, which will be true when
the cpu failed to pull any task, causing it go idle. However that patch
relied on next_balance being calculated as a result of traversing cpu's
sched domain hierarchy.

A cpu that is the only online cpu will however not be attached to any
sched domain hierarchy. When such a cpu calls into idle_balance(), we
will end up initializing next_balance to be 1sec away! Such a CPU will
defer load balance check for another 1sec, even though we may bring up
more cpus in the meantime requiring it to check for load imbalance more
frequently. This could then lead to increased scheduling latency for
some tasks.

This patch results in a cpu's next_balance being re-calculated when its
attaching to a new sched domain hierarchy.  This should let cpus call
load balance checks at the right time we expect them to!

Signed-off-by: Srivatsa Vaddagiri <[email protected]>
(cherry picked from commit 91b609bdd91513c37c410ffef0204f8a0578e4f8)

Change-Id: Ib457f1a357aad26e49837a01311b73e22e0cc26f
Signed-off-by: Sridhar Nelloru <[email protected]>

sched: Unthrottle rt runqueues in __disable_runtime()

migrate_tasks() uses _pick_next_task_rt() to get tasks from the
real-time runqueues to be migrated. When rt_rq is throttled
_pick_next_task_rt() won't return anything, in which case
migrate_tasks() can't move all threads over and gets stuck in an
infinite loop.

Instead unthrottle rt runqueues before migrating tasks.

Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()

Signed-off-by: Peter Boonstoppel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix clear NOHZ_BALANCE_KICK

I have faced a sequence where the Idle Load Balance was sometime not
triggered for a while on my platform, in the following scenario:

 CPU 0 and CPU 1 are running tasks and CPU 2 is idle

 CPU 1 kicks the Idle Load Balance
 CPU 1 selects CPU 2 as the new Idle Load Balancer
 CPU 2 sets NOHZ_BALANCE_KICK for CPU 2
 CPU 2 sends a reschedule IPI to CPU 2

 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2

 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
       task A quickly goes back to sleep (before a tick occurs on CPU 2)
 CPU 2 goes back to idle with NOHZ_BALANCE_KICK set

Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent
because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be
performed.

We must wait for the sched softirq to be raised on CPU 2 thanks to another
part the kernel to come back to clear NOHZ_BALANCE_KICK.

The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
we can't raise the sched_softirq for the Idle Load Balance.

Change since V1:

- move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB
  can't run on this CPU (as suggested by Peter)

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 873b4c65b519fd769940eb281f77848227d4e5c1
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: minor merge resolution for 3.4 in scheduler_ipi()]
Signed-off-by: Steve Muckle <[email protected]>
Change-Id: I3548612057cccc2ecc29429c129c44183083831f

sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s

try_to_wake_up_local() should only be invoked to wake up another
task in the same runqueue and BUG_ON()s are used to enforce the
rule. Missing try_to_wake_up_local() can stall workqueue
execution but such stalls are likely to be finite either by
another work item being queued or the one blocked getting
unblocked.  There's no reason to trigger BUG while holding rq
lock crashing the whole system.

Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s.

Change-Id: I75fdaaf4dcaefcf3893e0404d98c8aa91f89934d
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Git-commit: 383efcd00053ec40023010ce5034bd702e7ab373
CRs-Fixed: 519942
Signed-off-by: Steve Muckle <[email protected]>

sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local()

The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local()
were recently converted from BUG_ON() calls. If these hit it indicates
something is wrong and that may contribute to other system instability.
To eliminate the risk of an instance of one of these errors going
un-noticed because there was an earlier instance that occured long ago,
change to WARN_ON(). If there ever is a flood of these there are bigger
problems.

Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7
Signed-off-by: Steve Muckle <[email protected]>

sched: Convert WARN_ON() to printk_sched() in try_to_wake_up_local()

try_to_wake_up_local() is called with the rq lock held. Printing to
console in this context can result in a deadlock if klogd needs to
be woken up. Print to the kernel log buffer via printk_sched()
instead which avoids the wakeup.

Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c
Signed-off-by: Syed Rameez Mustafa <[email protected]>

sched/rt: Use root_domain of rt_rq not current processor

When the system has multiple domains do_sched_rt_period_timer()
can run on any CPU and may iterate over all rt_rq in
cpu_online_mask.  This means when balance_runtime() is run for a
given rt_rq that rt_rq may be in a different rd than the current
processor.  Thus if we use smp_processor_id() to get rd in
do_balance_runtime() we may borrow runtime from a rt_rq that is
not part of our rd.

This changes do_balance_runtime to get the rd from the passed in
rt_rq ensuring that we borrow runtime only from the correct rd
for the given rt_rq.

This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime
when we try reclaim runtime lent to other rt_rq but runtime has
been lent to a rt_rq in another rd.

Change-Id: Id8b9f88bdd783ef87bfa8d288a2fe7e86dd0b1e2
Signed-off-by: Shawn Bohrer <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Acked-by: Mike Galbraith <[email protected]>
Cc: [email protected]
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 89960feebaf4f9a53f93a0ce6888207e4a808799
Git-repo: git://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git
Signed-off-by: Syed Rameez Mustafa <[email protected]>

sched: Remove redundant update_runtime notifier

migration_call() will do all the things that update_runtime() does.
So let's remove it.

Furthermore, there is potential risk that the current code will catch
BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime
threads running because of enabling runtime twice while the rt_runtime
may already changed.

Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8
Signed-off-by: Neil Zhang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: resolved trivial header file context conflict]
Signed-off-by: Matt Wagantall <[email protected]>

sched: Fix hotplug vs. set_cpus_allowed_ptr()

Lai found that:

  WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b()
  ...
  migration_cpu_stop+0x1d/0x22

was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is
always a sub-set of cpu_online_mask.

This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness").

So set active and online at the same time to avoid this particular
problem.

CRs-Fixed: 680496
Change-Id: I89ac9b6829acf200072975bc7d028a469167f083
Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness")
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Gautham R. Shenoy <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael wang <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srivatsa S. Bhat <[email protected]>
Cc: Toshi Kani <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 24d52daafcd53f109893ec2faf668c8eeef4a382
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Matt Wagantall <[email protected]>

sched: Fix race between try_to_wake_up() and move_task()

Until a task's state has been seen as interruptible/uninterruptible
and it is no longer on_cpu, it is possible that the task may move
to another CPU (load balancing may cause this). Here is an example
where the race condition results in incorrect operation:

- cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING
- cpu 0 runs task B, which attempts to wake up A
- cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0
- cpu 1 then pulls task A (perhaps due to idle balance)
- cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE
- cpu 0 continues in try_to_wake_up(), thinking task A's previous
  cpu is 0, where it is actually 1
- if select_task_rq returns cpu 0, task A will be woken up on cpu 0
  without properly updating its cpu to 0 in set_task_cpu()

CRs-Fixed: 665958
Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99
Signed-off-by: Steve Muckle <[email protected]>

sched: Fix reference to stale task_struct in try_to_wake_up()

try_to_wake_up() currently drops p->pi_lock and later checks for need
to notify cpufreq governor on task migrations or wakeups. However the
woken task could exit between the time p->pi_lock is released and the
time the test for notification is run. As a result, the test for
notification could refer to an exited task. task_notify_on_migrate(p)
could thus lead to invalid memory reference.

Fix this by running the test for notification with task's pi_lock
held.

Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Bail out of yield_to when source and target runqueue has one task

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple handler upon
 seeing source runqueue length = 1, but it had to export rq length).
 Peter came up with the elegant idea of return -ESRCH in scheduler core.

Signed-off-by: Peter Zijlstra <[email protected]>
Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Andrew Jones <[email protected]>
Tested-by: Chegu Vinod <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>

Change-Id: I4eca8afc683c1a4b9d2490422baacd3da76c1e92

sched: Fix select_idle_sibling() bouncing cow syndrome

If the previous CPU is cache affine and idle, select it.

The current implementation simply traverses the sd_llc domain,
taking the first idle CPU encountered, which walks buddy pairs
hand in hand over the package, inflicting excruciating pain.

1 tbench pair (worst case) in a 10 core + SMT package:

  pre   15.22 MB/sec 1 procs
  post 252.01 MB/sec 1 procs

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix double normalization of vruntime

dequeue_entity() is called when p->on_rq and sets se->on_rq = 0
which appears to guarentee that the !se->on_rq condition is met.
If the task has done set_current_state(TASK_INTERRUPTIBLE) without
schedule() the second condition will be met and vruntime will be
incorrectly adjusted twice.

In certain cases this can result in the task's vruntime never increasing
past the vruntime of other tasks on the CFS' run queue, starving them of
CPU time.

This patch changes switched_from_fair() to use !p->on_rq instead of
!se->on_rq.

I'm able to cause a task with a priority of 120 to starve all other
tasks with the same priority on an ARM platform running 3.2.51-rt72
PREEMPT RT by writing one character at time to a serial tty (16550 UART)
in a tight loop. I'm also able to verify making this change corrects the
problem on that platform and kernel version.

Change-Id: I70aa0fe0e9be146b1a2a903d010b318390262dc3
Signed-off-by: George McCollister <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix divide by zero at {thread_group,task}_times

On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.

This problem can be triggered in practice on very long lived processes:

  PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
   #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
   #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
   #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
   #3 [ffff880472a51cd0] die at ffffffff8100f26b
   #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
   #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
   #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
      [exception RIP: thread_group_times+0x56]
      RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
      RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
      RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
      RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
      R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
   #8 [ffff880472a51f40] sys_times at ffffffff81088524
   #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
      RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
      RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
      RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
      RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
      R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
      ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b

Change-Id: I1e0fa1fa38b86566f5b20e5c0efa0c9eb17b203b
Signed-off-by: Stanislaw Gruszka <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched_clock: Prevent 64bit inatomicity on 32bit systems

The sched_clock_remote() implementation has the following inatomicity
problem on 32bit systems when accessing the remote scd->clock, which
is a 64bit value.

CPU0			    CPU1

sched_clock_local()	    sched_clock_remote(CPU0)
...
			    remote_clock = scd[CPU0]->clock
			    read_low32bit(scd[CPU0]->clock)
cmpxchg64(scd->clock,...)
			    read_high32bit(scd[CPU0]->clock)

While the update of scd->clock is using an atomic64 mechanism, the
readout on the remote cpu is not, which can cause completely bogus
readouts.

It is a quite rare problem, because it requires the update to hit the
narrow race window between the low/high readout and the update must go
across the 32bit boundary.

The resulting misbehaviour is, that CPU1 will see the sched_clock on
CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
to this bogus timestamp. This stays that way due to the clamping
implementation for about 4 seconds until the synchronization with
CLOCK_MONOTONIC undoes the problem.

The issue is hard to observe, because it might only result in a less
accurate SCHED_OTHER timeslicing behaviour. To create observable
damage on realtime scheduling classes, it is necessary that the bogus
update of CPU1 sched_clock happens in the context of an realtime
thread, which then gets charged 4 seconds of RT runtime, which results
in the RT throttler mechanism to trigger and prevent scheduling of RT
tasks for a little less than 4 seconds. So this is quite unlikely as
well.

The issue was quite hard to decode as the reproduction time is between
2 days and 3 weeks and intrusive tracing makes it less likely, but the
following trace recorded with trace_clock=global, which uses
sched_clock_local(), gave the final hint:

  <idle>-0   0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
  <idle>-0   0d..30 400269.477151: hrtimer_start:  hrtimer=0xf7061e80 ...
irq/20-S-587 1d..32 400273.772118: sched_wakeup:   comm= ... target_cpu=0
  <idle>-0   0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80

What happens is that CPU0 goes idle and invokes
sched_clock_idle_sleep_event() which invokes sched_clock_local() and
CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
sched_remote_clock(). The time jump gets propagated to CPU0 via
sched_remote_clock() and stays stale on both cores for ~4 seconds.

There are only two other possibilities, which could cause a stale
sched clock:

1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
   wrong value.

2) sched_clock() which reads the TSC returns a sporadic wrong value.

   one jiffy and then go stale.

   forward. It would just result in a stale sched_clock for one jiffy.

After quite some brain twisting and finding the same pattern on other
traces, sched_clock_remote() remained the only place which could cause
such a problem and as explained above it's indeed racy on 32bit
systems.

So while on 64bit systems the readout is atomic, we need to verify the
remote readout on 32bit machines. We need to protect the local->clock
readout in sched_clock_remote() on 32bit as well because an NMI could
hit between the low and the high readout, call sched_clock_local() and
modify local->clock.

Thanks to Siegfried Wulsch for bearing with my debug requests and
going through the tedious tasks of running a bunch of reproducer
systems to generate the debug information which let me decode the
issue.

Change-Id: I767a6a645e17651b39d0a530dc0681756f3b4f5d
Reported-by: Siegfried Wulsch <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities

This patch touches the RT group scheduling case.

Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
priority, while rt_rq passed to them may be not the top-level rt_rq.
This is wrong, because changing of priority on a child level does not
guarantee that the priority is the highest all over the rq. So, this
leak makes RT balancing unusable.

The short example: the task having the highest priority among all rq's
RT tasks (no one other task has the same priority) are waking on a
throttle rt_rq.  The rq's cpupri is set to the task's priority
equivalent, but real rq->rt.highest_prio.curr is less.

The patch below fixes the problem.

Change-Id: I01d3cc07870b1825efa1deed3b70f060ac20939f
Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
CC: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Guarantee new group-entities always have weight

Currently, group entity load-weights are initialized to zero. This
admits some races with respect to the first time they are re-weighted in
earlty use. ( Let g[x] denote the se for "g" on cpu "x". )

Suppose that we have root->a and that a enters a throttled state,
immediately followed by a[0]->t1 (the only task running on cpu[0])
blocking:

  put_prev_task(group_cfs_rq(a[0]), t1)
  put_prev_entity(..., t1)
  check_cfs_rq_runtime(group_cfs_rq(a[0]))
  throttle_cfs_rq(group_cfs_rq(a[0]))

Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
time:

  enqueue_task_fair(rq[0], t2)
  enqueue_entity(group_cfs_rq(b[0]), t2)
  enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
  account_entity_enqueue(group_cfs_ra(b[0]), t2)
  update_cfs_shares(group_cfs_rq(b[0]))
  < skipped because b is part of a throttled hierarchy >
  enqueue_entity(group_cfs_rq(a[0]), b[0])
  ...

We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
which violates invariants in several code-paths. Eliminate the
possibility of this by initializing group entity weight.

Change-Id: I1b09c5c4f85df1de565e7c862ba8dde65851c05f
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Chris J Arges <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Fix isolated CPUs leaving root_task_group indefinitely throttled

Root task group bandwidth replenishment must service all CPUs, regardless of
where the timer was last started, and regardless of the isolation mechanism,
lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.

Change-Id: I8485724e3ff0960008aa13634e55bcb6ceacbc01
Signed-off-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Avoid updating RT entry timeout twice within one tick period

The issue below was found in 2.6.34-rt rather than mainline rt kernel, but
the issue still exists upstream as well.

So please let me describe how it was noticed on 2.6.34-rt:

On this version, each softirq has its own thread, which means there is at
least one RT FIFO task per cpu.  The priority of these tasks is set to 49
by default.  If user launches an RT FIFO task with priority lower than 49
of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued
one cpu runqueue at one moment. By current strategy of balancing RT tasks,
when it comes to RT tasks, we really need to put them off to a CPU that
they can run on as soon as possible. Even if it means a bit of cache line
flushing, we want RT tasks to be run with the least latency.

When the user RT FIFO task which just launched before is running, the
sched timer tick of the current cpu happens. In this tick period, the
timeout value of the user RT task will be updated once. Subsequently,
we try to wake up one softirq RT task on its local cpu. As the priority
of current user RT task is lower than the softirq RT task, the current
task will be preempted by the higher priority softirq RT task.  Before
preemption, we check to see if current can readily move to a different
cpu. If so, we will reschedule to allow the RT push logic to try to move
current somewhere else. Whenever the woken softirq RT task runs, it first
tries to migrate the user FIFO RT task over to a cpu that is running a
task of lesser priority. If migration is done, it will send a reschedule
request to the found cpu by IPI interrupt. Once the target cpu responds
the IPI interrupt, it will pick the migrated user RT task to preempt its
current task. When the user RT task is running on the new cpu, the sched
timer tick of the cpu fires. So it will tick the user RT task again. This
also means the RT task timeout value will be updated again.  As the
migration may be done in one tick period, it means the user RT task
timeout value will be updated twice within one tick.

If we set a limit on the amount of cpu time for the user RT task by
setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon
reaching the soft limit.

But exactly when the SIGXCPU signal should be sent depends on the RT task
timeout value. In fact the timeout mechanism of sending the SIGXCPU signal
assumes the RT task timeout is increased once every tick.

However, currently the timeout value may be added twice per tick. So it
results in the SIGXCPU signal being sent earlier than expected.

To solve this issue, we prevent the timeout value from increasing twice
within one tick time by remembering the jiffies value of last updating the
timeout. As long as the RT task's jiffies is different with the global
jiffies value, we allow its timeout to be updated.

Signed-off-by: Ying Xue <[email protected]>
Signed-off-by: Fan Du <[email protected]>
Reviewed-by: Yong Zhang <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
[ lizf: backported to 3.4: adjust context ]
Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Change-Id: I1d126959d88f3bc1adb4a707661efb7162048424

sched: rt: Reduce rq lock contention by eliminating locking of non-feasible target

This patch adds checks that prevens futile attempts to move rt tasks to a
CPU with active tasks of equal or higher priority.

This reduces run queue lock contention and improves the performance of a
well known OLTP benchmark by 0.7%.

Change-Id: I232abe50623593d51be508fc64e8b865dcc77212
Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Shawn Bohrer <[email protected]>
Cc: Suruchi Kadu <[email protected]>
Cc: Doug Nelson<[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

sched: Fix RLIMIT_RTTIME when PI-boosting to RT

When non-realtime tasks get priority-inheritance boosted to a realtime
scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
counter used for checking this (the same one used for SCHED_RR
timeslices) was not getting reset. This meant that tasks running with a
non-realtime scheduling class which are repeatedly boosted to a realtime
one, but never block while they are running realtime, eventually hit the
timeout without ever running for a time over the limit. This patch
resets the realtime timeslice counter when un-PI-boosting from an RT to
a non-RT scheduling class.

I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
mutex which induces priority boosting and spins while boosted that gets
killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
does happen eventually with PREEMPT_VOLUNTARY kernels.

Change-Id: I732036119517d9a9625cea4644b4509c1d6cdf13
Signed-off-by: Brian Silverman <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
[lizf: Backported to 3.4: adjust contest]
Signed-off-by: Zefan Li <[email protected]>

sched: Queue RT tasks to head when prio drops

The following scenario does not work correctly:

Runqueue of CPUx contains two runnable and pinned tasks:

 T1: SCHED_FIFO, prio 80
 T2: SCHED_FIFO, prio 80

T1 is on the cpu and executes the following syscalls (classic priority
ceiling scenario):

 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90);
 ...
 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80);
 ...

Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back
to sleep the scheduler picks T2. Surprise!

The same happens w/o actual preemption when T1 is forced into the
scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes
pick_next_task() which returns T2. So T1 gets preempted and scheduled
out.

This happens because sched_setscheduler() dequeues T1 from the prio 90
list and then enqueues it on the tail of the prio 80 list behind T2.
This violates the POSIX spec and surprises user space which relies on
the guarantee that SCHED_FIFO tasks are not scheduled out unless they
give the CPU up voluntarily or are preempted by a higher priority
task. In the latter case the preempted task must get back on the CPU
after the preempting task schedules out again.

We fixed a similar issue already in commit 60db48c (sched: Queue a
deboosted task to the head of the RT prio queue). The same treatment
is necessary for sched_setscheduler(). So enqueue to head of the prio
bucket list if the priority of the task is lowered.

It might be possible that existing user space relies on the current
behaviour, but it can be considered highly unlikely due to the corner
case nature of the application scenario.

Change-Id: I65f159b3c2d265dcaf27b70bf9be88cf922ab845
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

sched: Fix the theoretical signal_wake_up() vs schedule() race

This is only theoretical, but after try_to_wake_up(p) was changed to check
p->state under p->pi_lock the code like:

	__set_current_state(TASK_INTERRUPTIBLE);
	schedule();

can miss a signal.  This is the special case of wait-for-condition, it
relies on try_to_wake_up/schedule interaction and thus it does not need
mb() between __set_current_state() and if (signal_pending).

However, this __set_current_state() can move into the critical section
protected by rq->lock, now that try_to_wake_up() takes another lock we
need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.

The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock).  This is what try_to_wake_up() already
does by the same reason.

We turn this wmb() into the new helper, smp_mb__before_spinlock(), for
better documentation and to allow the architectures to change the default
implementation.

While at it, kill smp_mb__after_lock(), it has no callers.

Perhaps we can also add smp_mb__before/after_spinunlock() for
prepare_to_wait().

Change-Id: I2b329b2ae213cc9221811ca244d2ebb044011a1b
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix race in idle_balance()

The scheduler main function 'schedule()' checks if there are no more tasks
on the runqueue. Then it checks if a task should be pulled in the current
runqueue in idle_balance() assuming it will go to idle otherwise.

But idle_balance() releases the rq->lock in order to look up the sched
domains and takes the lock again right after. That opens a window where
another cpu may put a task in our runqueue, so we won't go to idle but
we have filled the idle_stamp, thinking we will.

This patch closes the window by checking if the runqueue has been modified
but without pulling a task after taking the lock again, so we won't go to idle
right after in the __schedule() function.

Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8
Signed-off-by: Daniel Lezcano <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix load avg vs cpu-hotplug

Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.

Reported-by: Rakib Mullick <[email protected]>
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix load avg vs. cpu-hotplug

Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an
incomplete fix:

In particular, the problem is that at the point it calls
calc_load_migrate() nr_running := 1 (the stopper thread), so move the
call to CPU_DEAD where we're sure that nr_running := 0.

Also note that we can call calc_load_migrate() without serialization, we
know the state of rq is stable since its cpu is dead, and we modify the
global state using appropriate atomic ops.

Change-Id: I24063eebf4429d7e01838af2cb68e0f3c2cf2131
Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>

sched/nohz: Rewrite and fix load-avg computation -- again

Thanks to Charles Wang for spotting the defects in the current code:

 - If we go idle during the sample window -- after sampling, we get a
   negative bias because we can negate our own sample.

 - If we wake up during the sample window we get a positive bias
   because we push the sample to a known active period.

So rewrite the entire nohz load-avg muck once again, now adding
copious documentation to the code.

Reported-and-tested-by: Doug Smythies <[email protected]>
Reported-and-tested-by: Charles Wang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
[ minor edits ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Change-Id: Ib735002c79eaace9fa5f7795fdc2f80b638b56a7

sched: Fix the broken sched_rr_get_interval()

The caller of sched_sliced() should pass se.cfs_rq and se as the
arguments, however in sched_rr_get_interval() we gave it
rq.cfs_rq and se, which made the following computation obviously
wrong.

The change was introduced by commit:

  77034937dc45 sched: fix crash in sys_sched_rr_get_interval()

... 5 years ago, while it had been the correct 'cfs_rq_of' before
the commit. The change seems to be irrelevant to the commit
msg, which was to return a 0 timeslice for tasks that are on an
idle runqueue. So I believe that was just a plain typo.

Signed-off-by: Zhu Yanhai <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Since this is an ABI and an old bug, we'll test this via a
  slow upstream route, to hopefully discover any app breakage. ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix race in task_group()

Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call
task_group() too many times in set_task_rq()").  He found the reason to be
that the multiple task_group() invocations in set_task_rq() returned
different values.

Looking at all that I found a lack of serialization and plain wrong
comments.

The below tries to fix it using an extra pointer which is updated under
the appropriate scheduler locks. Its not pretty, but I can't really see
another way given how all the cgroup stuff works.

Change-Id: I6906c92ea6f66d19af5938a9f5ec9f1ff4577c39
Reported-and-tested-by: Stefan Bader <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume

In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.

However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.

In order to achieve that, do the following:

1. Don't modify cpusets during suspend/resume. At all.
   In particular, don't move the tasks from one cpuset to another, and
   don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
   during the CPU hotplug operations that are carried out in the
   suspend/resume path.

2. However, cpusets and sched domains are related. We just want to avoid
   altering cpusets alone. So, to keep the sched domains updated, build
   a single sched domain (containing all active cpus) during each of the
   CPU hotplug operations carried out in s/r path, effectively ignoring
   the cpusets' cpus_allowed masks.

   (Since userspace is frozen while doing all this, it will go unnoticed.)

3. During the last CPU online operation during resume, build the sched
   domains by looking up the (unaltered) cpusets' cpus_allowed masks.
   That will bring back the system to the same original state as it was in
   before suspend.

Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.

Change-Id: I12c3e6289d5aa069531f8f616d63412b38103d99
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Preeti U Murthy <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones

There is a small race between copy_process() and cgroup_attach_task()
where child->se.parent,cfs_rq points to invalid (old) ones.

        parent doing fork()      | someone moving the parent to another cgroup
  -------------------------------+---------------------------------------------
    copy_process()
      + dup_task_struct()
        -> parent->se is copied to child->se.
           se.parent,cfs_rq of them point to old ones.

                                     cgroup_attach_task()
                                       + cgroup_task_migrate()
                                         -> parent->cgroup is updated.
                                       + cpu_cgroup_attach()
                                         + sched_move_task()
                                           + task_move_group_fair()
                                             +- set_task_rq()
                                                -> se.parent,cfs_rq of parent
                                                   are updated.

      + cgroup_fork()
        -> parent->cgroup is copied to child->cgroup. (*1)
      + sched_fork()
        + task_fork_fair()
          -> se.parent,cfs_rq of child are accessed
             while they point to old ones. (*2)

In the worst case, this bug can lead to "use-after-free" and cause a panic,
because it's new cgroup's refcount that is incremented at (*1),
so the old cgroup(and related data) can be freed before (*2).

In fact, a panic caused by this bug was originally caught in RHEL6.4.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0
    [...]
    Call Trace:
     [<ffffffff81051f25>] place_entity+0x75/0xa0
     [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160
     [<ffffffff81063c0b>] sched_fork+0x6b/0x140
     [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450
     [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130
     [<ffffffff8106d2f4>] do_fork+0x94/0x460
     [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100
     [<ffffffff81009598>] sys_clone+0x28/0x30
     [<ffffffff8100b393>] stub_clone+0x13/0x20
     [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

Signed-off-by: Daisuke Nishimura <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/nohz: Fix rq->cpu_load[] calculations

While investigating why the load-balancer did funny I found that the
rq->cpu_load[] tables were completely screwy.. a bit more digging
revealed that the updates that got through were missing ticks followed
by a catchup of 2 ticks.

The catchup assumes the cpu was idle during that time (since only nohz
can cause missed ticks and the machine is idle etc..) this means that
esp. the higher indices were significantly lower than they ought to
be.

The reason for this is that its not correct to compare against jiffies
on every jiffy on any other cpu than the cpu that updates jiffies.

This patch cludges around it by only doing the catch-up stuff from
nohz_idle_balance() and doing the regular stuff unconditionally from
the tick.

Change-Id: Iac6eda1ded5074afb261ccc0fe8a4e70432bb031
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Venkatesh Pallipadi <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/nohz: Fix rq->cpu_load calculations some more

commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream.

Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[]
calculations") since while that fixed the busy case it regressed the
mostly idle case.

Add a callback from the nohz exit to also age the rq->cpu_load[]
array. This closes the hole where either there was no nohz load
balance pass during the nohz, or there was a 'significant' amount of
idle time between the last nohz balance and the nohz exit.

So we'll update unconditionally from the tick to not insert any
accidental 0 load periods while busy, and we try and catch up from
nohz idle balance and nohz exit. Both these are still prone to missing
a jiffy, but that has always been the case.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Venkatesh Pallipadi <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/core: Clear the root_domain cpumasks in init_rootdomain()

root_domain::rto_mask allocated through alloc_cpumask_var() contains
garbage data, and this may cause problems.  For instance, when doing
pull_rt_task(), it may do useless iterations if rto_mask retains some
extra garbage bits. Worse still, this violates the isolated domain rule
for clustered scheduling using cpuset, because the tasks (with all the
cpus allowed) belongs to one root domain can be pulled away into another
root domain.

The patch cleans the garbage by using zalloc_cpumask_var() instead of
alloc_cpumask_var() for root_domain::rto_mask allocation, thereby
addressing the issues.

Do the same thing for root_domain's other cpumask memembers:
- span,
- online;

Signed-off-by: Xunlei Pang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

softirq: Reduce latencies

In various network workloads, __do_softirq() latencies can be up to 20 ms
if HZ=1000, and 200 ms if HZ=100.

This is because we iterate 10 times in the softirq dispatcher, and some
actions can consume a lot of cycles.

This patch changes the fallback to ksoftirqd condition to:
- A time limit of 2 ms,
- need_resched() being set on current task;

When one of this condition is met, we wakeup ksoftirqd for further softirq
processing if we still have pending softirqs.

Using need_resched() as the only condition can trigger RCU stalls, as we
can keep BH disabled for too long.

I ran several benchmarks and got no significant difference in throughput,
but a very significant reduction of latencies (one order of magnitude):

In following bench, 200 antagonist "netperf -t TCP_RR" are started in
background, using all available cpus.

Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC
IRQ (hard + soft).

Before patch:

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=550110.424
MIN_LATENCY=146858
MAX_LATENCY=997109
P50_LATENCY=305000
P90_LATENCY=550000
P99_LATENCY=710000
MEAN_LATENCY=376989.12
STDDEV_LATENCY=184046.92

After patch:

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=40545.492
MIN_LATENCY=9834
MAX_LATENCY=78366
P50_LATENCY=33583
P90_LATENCY=59000
P99_LATENCY=69000
MEAN_LATENCY=38364.67
STDDEV_LATENCY=12865.26

Signed-off-by: Eric Dumazet <[email protected]>
Cc: David Miller <[email protected]>
Cc: Tom Herbert <[email protected]>
Cc: Ben Hutchings <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

softirq: Fix lockup related to stop_machine being stuck in __do_softirq.

The stop machine logic can lock up if all but one of the migration threads
make it through the disable-irq step and the one remaining thread gets
stuck in __do_softirq. The reason __do_softirq can hang is that it has a
bail-out based on jiffies timeout, but in the lockup case, jiffies itself
is not incremented.

To work around this, re-add the max_restart counter in __do_irq and stop
processing irqs after 10 restarts.

Thanks to Tejun Heo and Rusty Russell and others for helping me track this
down.

This was introduced in 3.9 by commit c10d73671ad3
("softirq: reduce latencies").

It may be worth looking into ath9k to see if it has issues with its irq
handler at a later date.

The hang stack traces look something like this:

    ------------[ cut here ]------------
    WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7()
    Watchdog detected hard LOCKUP on cpu 2
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    Pid: 23, comm: migration/2 Tainted: G         C   3.9.4+ #11
    Call Trace:
     <NMI>   warn_slowpath_common+0x85/0x9f
      warn_slowpath_fmt+0x46/0x48
      watchdog_overflow_callback+0x9c/0xa7
      __perf_event_overflow+0x137/0x1cb
      perf_event_overflow+0x14/0x16
      intel_pmu_handle_irq+0x2dc/0x359
      perf_event_nmi_handler+0x19/0x1b
      nmi_handle+0x7f/0xc2
      do_nmi+0xbc/0x304
      end_repeat_nmi+0x1e/0x2e
     <<EOE>>
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0
    ---[ end trace 4947dfa9b0a4cec3 ]---
    BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17]
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    irq event stamp: 835637905
    hardirqs last  enabled at (835637904): __do_softirq+0x9f/0x257
    hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80
    softirqs last  enabled at (5654720): __do_softirq+0x1ff/0x257
    softirqs last disabled at (5654725): irq_exit+0x5f/0xbb
    CPU 1
    Pid: 17, comm: migration/1 Tainted: G        WC   3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
    RIP: tasklet_hi_action+0xf0/0xf0
    Process migration/1
    Call Trace:
     <IRQ>
      __do_softirq+0x117/0x257
      irq_exit+0x5f/0xbb
      smp_apic_timer_interrupt+0x8a/0x98
      apic_timer_interrupt+0x72/0x80
     <EOI>
      printk+0x4d/0x4f
      stop_machine_cpu_stop+0x22c/0x274
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0

Change-Id: Ib213d5a5c59c85ae16afde6f42a01e39d2c93fd7
Signed-off-by: Ben Greear <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Pekka Riikonen <[email protected]>
Cc: Eric Dumazet <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <[email protected]>
Signed-off-by: Zefan Li <[email protected]>
alesaiko pushed a commit that referenced this issue Mar 26, 2018
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior
of waking their tasks up on idle CPUs. The feature has too much of a
negative impact on other workloads however to apply globally. The
PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this
flag set, or tasks woken by tasks with this flag set, on an idle CPU
if one is available.

Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039
Signed-off-by: Steve Muckle <[email protected]>

sched: add sysctl for controlling task migrations on wake

The PF_WAKE_UP_IDLE per-task flag made it impossible to enable
the old behavior of SD_SHARE_PKG_RESOURCES, where every task
migrates to an idle CPU on wakeup.

The sched_wake_to_idle sysctl value, when made nonzero, will cause
all tasks to migrate to an idle CPU if one is available when the
task is woken up. This is regardless of how PF_WAKE_UP_IDLE is
configured for tasks in the system. Similar to PF_WAKE_UP_IDLE,
the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled
for the sysctl value to have an effect.

Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61
Signed-off-by: Steve Muckle <[email protected]>
(cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282)

Conflicts:

	kernel/sched/fair.c

sched: Reset rq->next_interval before going idle

next_balance, the point in jiffy time scale when a cpu will next load
balance, could have been calculated when the cpu was busy. A busy cpu
will apply its sched domain's busy_factor (usually > 1) in computing
next_balance for that sched domain, which causes the (busy) cpu to load
balance less frequently in its sched domains. However when the same cpu
is going idle, its next_balance needs to be reset without consideration
of busy_factor. Failure to do so would not trigger nohz idle balancer on
that cpu for unnecessarily long time (introducing additional scheduling
latencies for tasks). Fix bug in scheduler which aims to reset
next_balance before a cpu goes idle (as per existing comment) but is
clearly not doing so.

Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
(just when they would have exceeded their ideal_runtime). It makes use
of a a per-cpu hrtimer resource and hence alarming that hrtimer should
be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs,
rather than being based on number of tasks in a particular cfs_rq (as
implemented currently). As a result, with current code, its possible for
a running task (which is the sole task in its cfs_rq) to be preempted
much after its ideal_runtime has elapsed, resulting in increased latency
for tasks in other cfs_rq on same cpu.

Fix this by alarming sched hrtimer based on total number of SCHED_FAIR
tasks a CPU has across its various cfs_rqs.

Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Mark schedule_io_timeout() with EXPORT_SYMBOL

Make schedule_io_timeout() visible to modules.

Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2
Signed-off-by: Jordan Crouse <[email protected]>

sched: fix reference to wrong cfs_rq

Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption
of tasks) introduced a bug in sched_slice() calculation by using wrong
cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather
than the correct one to which they belonged.

Fix the bug by using correct cfs_rq for tasks.

Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Make sure to not re-read variables after validation

We could re-read rq->rt_avg after we validated it was smaller than
total, invalidating the check and resulting in an unintended negative.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: David Rientjes <[email protected]>
Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
CRs-Fixed: 497236
Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
(cherry picked from commit fef154fa32c5d8ef992f11e11919495979970678)

Change-Id: Ieac7a10d9f4ceb701660f8055efd7a0ed5d939ca
Signed-off-by: Sridhar Nelloru <[email protected]>

sched: re-calculate a cpu's next_balance point upon sched domain changes

Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset
a cpu's rq->next_balance when pulled_task = 0, which will be true when
the cpu failed to pull any task, causing it go idle. However that patch
relied on next_balance being calculated as a result of traversing cpu's
sched domain hierarchy.

A cpu that is the only online cpu will however not be attached to any
sched domain hierarchy. When such a cpu calls into idle_balance(), we
will end up initializing next_balance to be 1sec away! Such a CPU will
defer load balance check for another 1sec, even though we may bring up
more cpus in the meantime requiring it to check for load imbalance more
frequently. This could then lead to increased scheduling latency for
some tasks.

This patch results in a cpu's next_balance being re-calculated when its
attaching to a new sched domain hierarchy.  This should let cpus call
load balance checks at the right time we expect them to!

Signed-off-by: Srivatsa Vaddagiri <[email protected]>
(cherry picked from commit 91b609bdd91513c37c410ffef0204f8a0578e4f8)

Change-Id: Ib457f1a357aad26e49837a01311b73e22e0cc26f
Signed-off-by: Sridhar Nelloru <[email protected]>

sched: Unthrottle rt runqueues in __disable_runtime()

migrate_tasks() uses _pick_next_task_rt() to get tasks from the
real-time runqueues to be migrated. When rt_rq is throttled
_pick_next_task_rt() won't return anything, in which case
migrate_tasks() can't move all threads over and gets stuck in an
infinite loop.

Instead unthrottle rt runqueues before migrating tasks.

Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()

Signed-off-by: Peter Boonstoppel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix clear NOHZ_BALANCE_KICK

I have faced a sequence where the Idle Load Balance was sometime not
triggered for a while on my platform, in the following scenario:

 CPU 0 and CPU 1 are running tasks and CPU 2 is idle

 CPU 1 kicks the Idle Load Balance
 CPU 1 selects CPU 2 as the new Idle Load Balancer
 CPU 2 sets NOHZ_BALANCE_KICK for CPU 2
 CPU 2 sends a reschedule IPI to CPU 2

 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2

 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
       task A quickly goes back to sleep (before a tick occurs on CPU 2)
 CPU 2 goes back to idle with NOHZ_BALANCE_KICK set

Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent
because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be
performed.

We must wait for the sched softirq to be raised on CPU 2 thanks to another
part the kernel to come back to clear NOHZ_BALANCE_KICK.

The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
we can't raise the sched_softirq for the Idle Load Balance.

Change since V1:

- move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB
  can't run on this CPU (as suggested by Peter)

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 873b4c65b519fd769940eb281f77848227d4e5c1
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: minor merge resolution for 3.4 in scheduler_ipi()]
Signed-off-by: Steve Muckle <[email protected]>
Change-Id: I3548612057cccc2ecc29429c129c44183083831f

sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s

try_to_wake_up_local() should only be invoked to wake up another
task in the same runqueue and BUG_ON()s are used to enforce the
rule. Missing try_to_wake_up_local() can stall workqueue
execution but such stalls are likely to be finite either by
another work item being queued or the one blocked getting
unblocked.  There's no reason to trigger BUG while holding rq
lock crashing the whole system.

Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s.

Change-Id: I75fdaaf4dcaefcf3893e0404d98c8aa91f89934d
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Git-commit: 383efcd00053ec40023010ce5034bd702e7ab373
CRs-Fixed: 519942
Signed-off-by: Steve Muckle <[email protected]>

sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local()

The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local()
were recently converted from BUG_ON() calls. If these hit it indicates
something is wrong and that may contribute to other system instability.
To eliminate the risk of an instance of one of these errors going
un-noticed because there was an earlier instance that occured long ago,
change to WARN_ON(). If there ever is a flood of these there are bigger
problems.

Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7
Signed-off-by: Steve Muckle <[email protected]>

sched: Convert WARN_ON() to printk_sched() in try_to_wake_up_local()

try_to_wake_up_local() is called with the rq lock held. Printing to
console in this context can result in a deadlock if klogd needs to
be woken up. Print to the kernel log buffer via printk_sched()
instead which avoids the wakeup.

Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c
Signed-off-by: Syed Rameez Mustafa <[email protected]>

sched/rt: Use root_domain of rt_rq not current processor

When the system has multiple domains do_sched_rt_period_timer()
can run on any CPU and may iterate over all rt_rq in
cpu_online_mask.  This means when balance_runtime() is run for a
given rt_rq that rt_rq may be in a different rd than the current
processor.  Thus if we use smp_processor_id() to get rd in
do_balance_runtime() we may borrow runtime from a rt_rq that is
not part of our rd.

This changes do_balance_runtime to get the rd from the passed in
rt_rq ensuring that we borrow runtime only from the correct rd
for the given rt_rq.

This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime
when we try reclaim runtime lent to other rt_rq but runtime has
been lent to a rt_rq in another rd.

Change-Id: Id8b9f88bdd783ef87bfa8d288a2fe7e86dd0b1e2
Signed-off-by: Shawn Bohrer <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Acked-by: Mike Galbraith <[email protected]>
Cc: [email protected]
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 89960feebaf4f9a53f93a0ce6888207e4a808799
Git-repo: git://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git
Signed-off-by: Syed Rameez Mustafa <[email protected]>

sched: Remove redundant update_runtime notifier

migration_call() will do all the things that update_runtime() does.
So let's remove it.

Furthermore, there is potential risk that the current code will catch
BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime
threads running because of enabling runtime twice while the rt_runtime
may already changed.

Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8
Signed-off-by: Neil Zhang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: resolved trivial header file context conflict]
Signed-off-by: Matt Wagantall <[email protected]>

sched: Fix hotplug vs. set_cpus_allowed_ptr()

Lai found that:

  WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b()
  ...
  migration_cpu_stop+0x1d/0x22

was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is
always a sub-set of cpu_online_mask.

This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness").

So set active and online at the same time to avoid this particular
problem.

CRs-Fixed: 680496
Change-Id: I89ac9b6829acf200072975bc7d028a469167f083
Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness")
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Gautham R. Shenoy <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael wang <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srivatsa S. Bhat <[email protected]>
Cc: Toshi Kani <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 24d52daafcd53f109893ec2faf668c8eeef4a382
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Matt Wagantall <[email protected]>

sched: Fix race between try_to_wake_up() and move_task()

Until a task's state has been seen as interruptible/uninterruptible
and it is no longer on_cpu, it is possible that the task may move
to another CPU (load balancing may cause this). Here is an example
where the race condition results in incorrect operation:

- cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING
- cpu 0 runs task B, which attempts to wake up A
- cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0
- cpu 1 then pulls task A (perhaps due to idle balance)
- cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE
- cpu 0 continues in try_to_wake_up(), thinking task A's previous
  cpu is 0, where it is actually 1
- if select_task_rq returns cpu 0, task A will be woken up on cpu 0
  without properly updating its cpu to 0 in set_task_cpu()

CRs-Fixed: 665958
Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99
Signed-off-by: Steve Muckle <[email protected]>

sched: Fix reference to stale task_struct in try_to_wake_up()

try_to_wake_up() currently drops p->pi_lock and later checks for need
to notify cpufreq governor on task migrations or wakeups. However the
woken task could exit between the time p->pi_lock is released and the
time the test for notification is run. As a result, the test for
notification could refer to an exited task. task_notify_on_migrate(p)
could thus lead to invalid memory reference.

Fix this by running the test for notification with task's pi_lock
held.

Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Bail out of yield_to when source and target runqueue has one task

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple handler upon
 seeing source runqueue length = 1, but it had to export rq length).
 Peter came up with the elegant idea of return -ESRCH in scheduler core.

Signed-off-by: Peter Zijlstra <[email protected]>
Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Andrew Jones <[email protected]>
Tested-by: Chegu Vinod <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>

Change-Id: I4eca8afc683c1a4b9d2490422baacd3da76c1e92

sched: Fix select_idle_sibling() bouncing cow syndrome

If the previous CPU is cache affine and idle, select it.

The current implementation simply traverses the sd_llc domain,
taking the first idle CPU encountered, which walks buddy pairs
hand in hand over the package, inflicting excruciating pain.

1 tbench pair (worst case) in a 10 core + SMT package:

  pre   15.22 MB/sec 1 procs
  post 252.01 MB/sec 1 procs

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix double normalization of vruntime

dequeue_entity() is called when p->on_rq and sets se->on_rq = 0
which appears to guarentee that the !se->on_rq condition is met.
If the task has done set_current_state(TASK_INTERRUPTIBLE) without
schedule() the second condition will be met and vruntime will be
incorrectly adjusted twice.

In certain cases this can result in the task's vruntime never increasing
past the vruntime of other tasks on the CFS' run queue, starving them of
CPU time.

This patch changes switched_from_fair() to use !p->on_rq instead of
!se->on_rq.

I'm able to cause a task with a priority of 120 to starve all other
tasks with the same priority on an ARM platform running 3.2.51-rt72
PREEMPT RT by writing one character at time to a serial tty (16550 UART)
in a tight loop. I'm also able to verify making this change corrects the
problem on that platform and kernel version.

Change-Id: I70aa0fe0e9be146b1a2a903d010b318390262dc3
Signed-off-by: George McCollister <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix divide by zero at {thread_group,task}_times

On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.

This problem can be triggered in practice on very long lived processes:

  PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
   #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
   #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
   #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
   #3 [ffff880472a51cd0] die at ffffffff8100f26b
   #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
   #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
   #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
      [exception RIP: thread_group_times+0x56]
      RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
      RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
      RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
      RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
      R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
   #8 [ffff880472a51f40] sys_times at ffffffff81088524
   #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
      RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
      RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
      RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
      RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
      R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
      ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b

Change-Id: I1e0fa1fa38b86566f5b20e5c0efa0c9eb17b203b
Signed-off-by: Stanislaw Gruszka <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched_clock: Prevent 64bit inatomicity on 32bit systems

The sched_clock_remote() implementation has the following inatomicity
problem on 32bit systems when accessing the remote scd->clock, which
is a 64bit value.

CPU0			    CPU1

sched_clock_local()	    sched_clock_remote(CPU0)
...
			    remote_clock = scd[CPU0]->clock
			    read_low32bit(scd[CPU0]->clock)
cmpxchg64(scd->clock,...)
			    read_high32bit(scd[CPU0]->clock)

While the update of scd->clock is using an atomic64 mechanism, the
readout on the remote cpu is not, which can cause completely bogus
readouts.

It is a quite rare problem, because it requires the update to hit the
narrow race window between the low/high readout and the update must go
across the 32bit boundary.

The resulting misbehaviour is, that CPU1 will see the sched_clock on
CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
to this bogus timestamp. This stays that way due to the clamping
implementation for about 4 seconds until the synchronization with
CLOCK_MONOTONIC undoes the problem.

The issue is hard to observe, because it might only result in a less
accurate SCHED_OTHER timeslicing behaviour. To create observable
damage on realtime scheduling classes, it is necessary that the bogus
update of CPU1 sched_clock happens in the context of an realtime
thread, which then gets charged 4 seconds of RT runtime, which results
in the RT throttler mechanism to trigger and prevent scheduling of RT
tasks for a little less than 4 seconds. So this is quite unlikely as
well.

The issue was quite hard to decode as the reproduction time is between
2 days and 3 weeks and intrusive tracing makes it less likely, but the
following trace recorded with trace_clock=global, which uses
sched_clock_local(), gave the final hint:

  <idle>-0   0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
  <idle>-0   0d..30 400269.477151: hrtimer_start:  hrtimer=0xf7061e80 ...
irq/20-S-587 1d..32 400273.772118: sched_wakeup:   comm= ... target_cpu=0
  <idle>-0   0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80

What happens is that CPU0 goes idle and invokes
sched_clock_idle_sleep_event() which invokes sched_clock_local() and
CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
sched_remote_clock(). The time jump gets propagated to CPU0 via
sched_remote_clock() and stays stale on both cores for ~4 seconds.

There are only two other possibilities, which could cause a stale
sched clock:

1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
   wrong value.

2) sched_clock() which reads the TSC returns a sporadic wrong value.

   one jiffy and then go stale.

   forward. It would just result in a stale sched_clock for one jiffy.

After quite some brain twisting and finding the same pattern on other
traces, sched_clock_remote() remained the only place which could cause
such a problem and as explained above it's indeed racy on 32bit
systems.

So while on 64bit systems the readout is atomic, we need to verify the
remote readout on 32bit machines. We need to protect the local->clock
readout in sched_clock_remote() on 32bit as well because an NMI could
hit between the low and the high readout, call sched_clock_local() and
modify local->clock.

Thanks to Siegfried Wulsch for bearing with my debug requests and
going through the tedious tasks of running a bunch of reproducer
systems to generate the debug information which let me decode the
issue.

Change-Id: I767a6a645e17651b39d0a530dc0681756f3b4f5d
Reported-by: Siegfried Wulsch <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities

This patch touches the RT group scheduling case.

Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
priority, while rt_rq passed to them may be not the top-level rt_rq.
This is wrong, because changing of priority on a child level does not
guarantee that the priority is the highest all over the rq. So, this
leak makes RT balancing unusable.

The short example: the task having the highest priority among all rq's
RT tasks (no one other task has the same priority) are waking on a
throttle rt_rq.  The rq's cpupri is set to the task's priority
equivalent, but real rq->rt.highest_prio.curr is less.

The patch below fixes the problem.

Change-Id: I01d3cc07870b1825efa1deed3b70f060ac20939f
Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
CC: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Guarantee new group-entities always have weight

Currently, group entity load-weights are initialized to zero. This
admits some races with respect to the first time they are re-weighted in
earlty use. ( Let g[x] denote the se for "g" on cpu "x". )

Suppose that we have root->a and that a enters a throttled state,
immediately followed by a[0]->t1 (the only task running on cpu[0])
blocking:

  put_prev_task(group_cfs_rq(a[0]), t1)
  put_prev_entity(..., t1)
  check_cfs_rq_runtime(group_cfs_rq(a[0]))
  throttle_cfs_rq(group_cfs_rq(a[0]))

Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
time:

  enqueue_task_fair(rq[0], t2)
  enqueue_entity(group_cfs_rq(b[0]), t2)
  enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
  account_entity_enqueue(group_cfs_ra(b[0]), t2)
  update_cfs_shares(group_cfs_rq(b[0]))
  < skipped because b is part of a throttled hierarchy >
  enqueue_entity(group_cfs_rq(a[0]), b[0])
  ...

We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
which violates invariants in several code-paths. Eliminate the
possibility of this by initializing group entity weight.

Change-Id: I1b09c5c4f85df1de565e7c862ba8dde65851c05f
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Chris J Arges <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Fix isolated CPUs leaving root_task_group indefinitely throttled

Root task group bandwidth replenishment must service all CPUs, regardless of
where the timer was last started, and regardless of the isolation mechanism,
lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.

Change-Id: I8485724e3ff0960008aa13634e55bcb6ceacbc01
Signed-off-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Avoid updating RT entry timeout twice within one tick period

The issue below was found in 2.6.34-rt rather than mainline rt kernel, but
the issue still exists upstream as well.

So please let me describe how it was noticed on 2.6.34-rt:

On this version, each softirq has its own thread, which means there is at
least one RT FIFO task per cpu.  The priority of these tasks is set to 49
by default.  If user launches an RT FIFO task with priority lower than 49
of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued
one cpu runqueue at one moment. By current strategy of balancing RT tasks,
when it comes to RT tasks, we really need to put them off to a CPU that
they can run on as soon as possible. Even if it means a bit of cache line
flushing, we want RT tasks to be run with the least latency.

When the user RT FIFO task which just launched before is running, the
sched timer tick of the current cpu happens. In this tick period, the
timeout value of the user RT task will be updated once. Subsequently,
we try to wake up one softirq RT task on its local cpu. As the priority
of current user RT task is lower than the softirq RT task, the current
task will be preempted by the higher priority softirq RT task.  Before
preemption, we check to see if current can readily move to a different
cpu. If so, we will reschedule to allow the RT push logic to try to move
current somewhere else. Whenever the woken softirq RT task runs, it first
tries to migrate the user FIFO RT task over to a cpu that is running a
task of lesser priority. If migration is done, it will send a reschedule
request to the found cpu by IPI interrupt. Once the target cpu responds
the IPI interrupt, it will pick the migrated user RT task to preempt its
current task. When the user RT task is running on the new cpu, the sched
timer tick of the cpu fires. So it will tick the user RT task again. This
also means the RT task timeout value will be updated again.  As the
migration may be done in one tick period, it means the user RT task
timeout value will be updated twice within one tick.

If we set a limit on the amount of cpu time for the user RT task by
setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon
reaching the soft limit.

But exactly when the SIGXCPU signal should be sent depends on the RT task
timeout value. In fact the timeout mechanism of sending the SIGXCPU signal
assumes the RT task timeout is increased once every tick.

However, currently the timeout value may be added twice per tick. So it
results in the SIGXCPU signal being sent earlier than expected.

To solve this issue, we prevent the timeout value from increasing twice
within one tick time by remembering the jiffies value of last updating the
timeout. As long as the RT task's jiffies is different with the global
jiffies value, we allow its timeout to be updated.

Signed-off-by: Ying Xue <[email protected]>
Signed-off-by: Fan Du <[email protected]>
Reviewed-by: Yong Zhang <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
[ lizf: backported to 3.4: adjust context ]
Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Change-Id: I1d126959d88f3bc1adb4a707661efb7162048424

sched: rt: Reduce rq lock contention by eliminating locking of non-feasible target

This patch adds checks that prevens futile attempts to move rt tasks to a
CPU with active tasks of equal or higher priority.

This reduces run queue lock contention and improves the performance of a
well known OLTP benchmark by 0.7%.

Change-Id: I232abe50623593d51be508fc64e8b865dcc77212
Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Shawn Bohrer <[email protected]>
Cc: Suruchi Kadu <[email protected]>
Cc: Doug Nelson<[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

sched: Fix RLIMIT_RTTIME when PI-boosting to RT

When non-realtime tasks get priority-inheritance boosted to a realtime
scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
counter used for checking this (the same one used for SCHED_RR
timeslices) was not getting reset. This meant that tasks running with a
non-realtime scheduling class which are repeatedly boosted to a realtime
one, but never block while they are running realtime, eventually hit the
timeout without ever running for a time over the limit. This patch
resets the realtime timeslice counter when un-PI-boosting from an RT to
a non-RT scheduling class.

I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
mutex which induces priority boosting and spins while boosted that gets
killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
does happen eventually with PREEMPT_VOLUNTARY kernels.

Change-Id: I732036119517d9a9625cea4644b4509c1d6cdf13
Signed-off-by: Brian Silverman <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
[lizf: Backported to 3.4: adjust contest]
Signed-off-by: Zefan Li <[email protected]>

sched: Queue RT tasks to head when prio drops

The following scenario does not work correctly:

Runqueue of CPUx contains two runnable and pinned tasks:

 T1: SCHED_FIFO, prio 80
 T2: SCHED_FIFO, prio 80

T1 is on the cpu and executes the following syscalls (classic priority
ceiling scenario):

 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90);
 ...
 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80);
 ...

Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back
to sleep the scheduler picks T2. Surprise!

The same happens w/o actual preemption when T1 is forced into the
scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes
pick_next_task() which returns T2. So T1 gets preempted and scheduled
out.

This happens because sched_setscheduler() dequeues T1 from the prio 90
list and then enqueues it on the tail of the prio 80 list behind T2.
This violates the POSIX spec and surprises user space which relies on
the guarantee that SCHED_FIFO tasks are not scheduled out unless they
give the CPU up voluntarily or are preempted by a higher priority
task. In the latter case the preempted task must get back on the CPU
after the preempting task schedules out again.

We fixed a similar issue already in commit 60db48c (sched: Queue a
deboosted task to the head of the RT prio queue). The same treatment
is necessary for sched_setscheduler(). So enqueue to head of the prio
bucket list if the priority of the task is lowered.

It might be possible that existing user space relies on the current
behaviour, but it can be considered highly unlikely due to the corner
case nature of the application scenario.

Change-Id: I65f159b3c2d265dcaf27b70bf9be88cf922ab845
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

sched: Fix the theoretical signal_wake_up() vs schedule() race

This is only theoretical, but after try_to_wake_up(p) was changed to check
p->state under p->pi_lock the code like:

	__set_current_state(TASK_INTERRUPTIBLE);
	schedule();

can miss a signal.  This is the special case of wait-for-condition, it
relies on try_to_wake_up/schedule interaction and thus it does not need
mb() between __set_current_state() and if (signal_pending).

However, this __set_current_state() can move into the critical section
protected by rq->lock, now that try_to_wake_up() takes another lock we
need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.

The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock).  This is what try_to_wake_up() already
does by the same reason.

We turn this wmb() into the new helper, smp_mb__before_spinlock(), for
better documentation and to allow the architectures to change the default
implementation.

While at it, kill smp_mb__after_lock(), it has no callers.

Perhaps we can also add smp_mb__before/after_spinunlock() for
prepare_to_wait().

Change-Id: I2b329b2ae213cc9221811ca244d2ebb044011a1b
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix race in idle_balance()

The scheduler main function 'schedule()' checks if there are no more tasks
on the runqueue. Then it checks if a task should be pulled in the current
runqueue in idle_balance() assuming it will go to idle otherwise.

But idle_balance() releases the rq->lock in order to look up the sched
domains and takes the lock again right after. That opens a window where
another cpu may put a task in our runqueue, so we won't go to idle but
we have filled the idle_stamp, thinking we will.

This patch closes the window by checking if the runqueue has been modified
but without pulling a task after taking the lock again, so we won't go to idle
right after in the __schedule() function.

Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8
Signed-off-by: Daniel Lezcano <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix load avg vs cpu-hotplug

Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.

Reported-by: Rakib Mullick <[email protected]>
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix load avg vs. cpu-hotplug

Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an
incomplete fix:

In particular, the problem is that at the point it calls
calc_load_migrate() nr_running := 1 (the stopper thread), so move the
call to CPU_DEAD where we're sure that nr_running := 0.

Also note that we can call calc_load_migrate() without serialization, we
know the state of rq is stable since its cpu is dead, and we modify the
global state using appropriate atomic ops.

Change-Id: I24063eebf4429d7e01838af2cb68e0f3c2cf2131
Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>

sched/nohz: Rewrite and fix load-avg computation -- again

Thanks to Charles Wang for spotting the defects in the current code:

 - If we go idle during the sample window -- after sampling, we get a
   negative bias because we can negate our own sample.

 - If we wake up during the sample window we get a positive bias
   because we push the sample to a known active period.

So rewrite the entire nohz load-avg muck once again, now adding
copious documentation to the code.

Reported-and-tested-by: Doug Smythies <[email protected]>
Reported-and-tested-by: Charles Wang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
[ minor edits ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Change-Id: Ib735002c79eaace9fa5f7795fdc2f80b638b56a7

sched: Fix the broken sched_rr_get_interval()

The caller of sched_sliced() should pass se.cfs_rq and se as the
arguments, however in sched_rr_get_interval() we gave it
rq.cfs_rq and se, which made the following computation obviously
wrong.

The change was introduced by commit:

  77034937dc45 sched: fix crash in sys_sched_rr_get_interval()

... 5 years ago, while it had been the correct 'cfs_rq_of' before
the commit. The change seems to be irrelevant to the commit
msg, which was to return a 0 timeslice for tasks that are on an
idle runqueue. So I believe that was just a plain typo.

Signed-off-by: Zhu Yanhai <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Since this is an ABI and an old bug, we'll test this via a
  slow upstream route, to hopefully discover any app breakage. ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix race in task_group()

Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call
task_group() too many times in set_task_rq()").  He found the reason to be
that the multiple task_group() invocations in set_task_rq() returned
different values.

Looking at all that I found a lack of serialization and plain wrong
comments.

The below tries to fix it using an extra pointer which is updated under
the appropriate scheduler locks. Its not pretty, but I can't really see
another way given how all the cgroup stuff works.

Change-Id: I6906c92ea6f66d19af5938a9f5ec9f1ff4577c39
Reported-and-tested-by: Stefan Bader <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume

In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.

However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.

In order to achieve that, do the following:

1. Don't modify cpusets during suspend/resume. At all.
   In particular, don't move the tasks from one cpuset to another, and
   don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
   during the CPU hotplug operations that are carried out in the
   suspend/resume path.

2. However, cpusets and sched domains are related. We just want to avoid
   altering cpusets alone. So, to keep the sched domains updated, build
   a single sched domain (containing all active cpus) during each of the
   CPU hotplug operations carried out in s/r path, effectively ignoring
   the cpusets' cpus_allowed masks.

   (Since userspace is frozen while doing all this, it will go unnoticed.)

3. During the last CPU online operation during resume, build the sched
   domains by looking up the (unaltered) cpusets' cpus_allowed masks.
   That will bring back the system to the same original state as it was in
   before suspend.

Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.

Change-Id: I12c3e6289d5aa069531f8f616d63412b38103d99
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Preeti U Murthy <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones

There is a small race between copy_process() and cgroup_attach_task()
where child->se.parent,cfs_rq points to invalid (old) ones.

        parent doing fork()      | someone moving the parent to another cgroup
  -------------------------------+---------------------------------------------
    copy_process()
      + dup_task_struct()
        -> parent->se is copied to child->se.
           se.parent,cfs_rq of them point to old ones.

                                     cgroup_attach_task()
                                       + cgroup_task_migrate()
                                         -> parent->cgroup is updated.
                                       + cpu_cgroup_attach()
                                         + sched_move_task()
                                           + task_move_group_fair()
                                             +- set_task_rq()
                                                -> se.parent,cfs_rq of parent
                                                   are updated.

      + cgroup_fork()
        -> parent->cgroup is copied to child->cgroup. (*1)
      + sched_fork()
        + task_fork_fair()
          -> se.parent,cfs_rq of child are accessed
             while they point to old ones. (*2)

In the worst case, this bug can lead to "use-after-free" and cause a panic,
because it's new cgroup's refcount that is incremented at (*1),
so the old cgroup(and related data) can be freed before (*2).

In fact, a panic caused by this bug was originally caught in RHEL6.4.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0
    [...]
    Call Trace:
     [<ffffffff81051f25>] place_entity+0x75/0xa0
     [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160
     [<ffffffff81063c0b>] sched_fork+0x6b/0x140
     [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450
     [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130
     [<ffffffff8106d2f4>] do_fork+0x94/0x460
     [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100
     [<ffffffff81009598>] sys_clone+0x28/0x30
     [<ffffffff8100b393>] stub_clone+0x13/0x20
     [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

Signed-off-by: Daisuke Nishimura <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/nohz: Fix rq->cpu_load[] calculations

While investigating why the load-balancer did funny I found that the
rq->cpu_load[] tables were completely screwy.. a bit more digging
revealed that the updates that got through were missing ticks followed
by a catchup of 2 ticks.

The catchup assumes the cpu was idle during that time (since only nohz
can cause missed ticks and the machine is idle etc..) this means that
esp. the higher indices were significantly lower than they ought to
be.

The reason for this is that its not correct to compare against jiffies
on every jiffy on any other cpu than the cpu that updates jiffies.

This patch cludges around it by only doing the catch-up stuff from
nohz_idle_balance() and doing the regular stuff unconditionally from
the tick.

Change-Id: Iac6eda1ded5074afb261ccc0fe8a4e70432bb031
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Venkatesh Pallipadi <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/nohz: Fix rq->cpu_load calculations some more

commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream.

Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[]
calculations") since while that fixed the busy case it regressed the
mostly idle case.

Add a callback from the nohz exit to also age the rq->cpu_load[]
array. This closes the hole where either there was no nohz load
balance pass during the nohz, or there was a 'significant' amount of
idle time between the last nohz balance and the nohz exit.

So we'll update unconditionally from the tick to not insert any
accidental 0 load periods while busy, and we try and catch up from
nohz idle balance and nohz exit. Both these are still prone to missing
a jiffy, but that has always been the case.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Venkatesh Pallipadi <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/core: Clear the root_domain cpumasks in init_rootdomain()

root_domain::rto_mask allocated through alloc_cpumask_var() contains
garbage data, and this may cause problems.  For instance, when doing
pull_rt_task(), it may do useless iterations if rto_mask retains some
extra garbage bits. Worse still, this violates the isolated domain rule
for clustered scheduling using cpuset, because the tasks (with all the
cpus allowed) belongs to one root domain can be pulled away into another
root domain.

The patch cleans the garbage by using zalloc_cpumask_var() instead of
alloc_cpumask_var() for root_domain::rto_mask allocation, thereby
addressing the issues.

Do the same thing for root_domain's other cpumask memembers:
- span,
- online;

Signed-off-by: Xunlei Pang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

softirq: Reduce latencies

In various network workloads, __do_softirq() latencies can be up to 20 ms
if HZ=1000, and 200 ms if HZ=100.

This is because we iterate 10 times in the softirq dispatcher, and some
actions can consume a lot of cycles.

This patch changes the fallback to ksoftirqd condition to:
- A time limit of 2 ms,
- need_resched() being set on current task;

When one of this condition is met, we wakeup ksoftirqd for further softirq
processing if we still have pending softirqs.

Using need_resched() as the only condition can trigger RCU stalls, as we
can keep BH disabled for too long.

I ran several benchmarks and got no significant difference in throughput,
but a very significant reduction of latencies (one order of magnitude):

In following bench, 200 antagonist "netperf -t TCP_RR" are started in
background, using all available cpus.

Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC
IRQ (hard + soft).

Before patch:

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=550110.424
MIN_LATENCY=146858
MAX_LATENCY=997109
P50_LATENCY=305000
P90_LATENCY=550000
P99_LATENCY=710000
MEAN_LATENCY=376989.12
STDDEV_LATENCY=184046.92

After patch:

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=40545.492
MIN_LATENCY=9834
MAX_LATENCY=78366
P50_LATENCY=33583
P90_LATENCY=59000
P99_LATENCY=69000
MEAN_LATENCY=38364.67
STDDEV_LATENCY=12865.26

Signed-off-by: Eric Dumazet <[email protected]>
Cc: David Miller <[email protected]>
Cc: Tom Herbert <[email protected]>
Cc: Ben Hutchings <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

softirq: Fix lockup related to stop_machine being stuck in __do_softirq.

The stop machine logic can lock up if all but one of the migration threads
make it through the disable-irq step and the one remaining thread gets
stuck in __do_softirq. The reason __do_softirq can hang is that it has a
bail-out based on jiffies timeout, but in the lockup case, jiffies itself
is not incremented.

To work around this, re-add the max_restart counter in __do_irq and stop
processing irqs after 10 restarts.

Thanks to Tejun Heo and Rusty Russell and others for helping me track this
down.

This was introduced in 3.9 by commit c10d73671ad3
("softirq: reduce latencies").

It may be worth looking into ath9k to see if it has issues with its irq
handler at a later date.

The hang stack traces look something like this:

    ------------[ cut here ]------------
    WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7()
    Watchdog detected hard LOCKUP on cpu 2
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    Pid: 23, comm: migration/2 Tainted: G         C   3.9.4+ #11
    Call Trace:
     <NMI>   warn_slowpath_common+0x85/0x9f
      warn_slowpath_fmt+0x46/0x48
      watchdog_overflow_callback+0x9c/0xa7
      __perf_event_overflow+0x137/0x1cb
      perf_event_overflow+0x14/0x16
      intel_pmu_handle_irq+0x2dc/0x359
      perf_event_nmi_handler+0x19/0x1b
      nmi_handle+0x7f/0xc2
      do_nmi+0xbc/0x304
      end_repeat_nmi+0x1e/0x2e
     <<EOE>>
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0
    ---[ end trace 4947dfa9b0a4cec3 ]---
    BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17]
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    irq event stamp: 835637905
    hardirqs last  enabled at (835637904): __do_softirq+0x9f/0x257
    hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80
    softirqs last  enabled at (5654720): __do_softirq+0x1ff/0x257
    softirqs last disabled at (5654725): irq_exit+0x5f/0xbb
    CPU 1
    Pid: 17, comm: migration/1 Tainted: G        WC   3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
    RIP: tasklet_hi_action+0xf0/0xf0
    Process migration/1
    Call Trace:
     <IRQ>
      __do_softirq+0x117/0x257
      irq_exit+0x5f/0xbb
      smp_apic_timer_interrupt+0x8a/0x98
      apic_timer_interrupt+0x72/0x80
     <EOI>
      printk+0x4d/0x4f
      stop_machine_cpu_stop+0x22c/0x274
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0

Change-Id: Ib213d5a5c59c85ae16afde6f42a01e39d2c93fd7
Signed-off-by: Ben Greear <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Pekka Riikonen <[email protected]>
Cc: Eric Dumazet <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <[email protected]>
Signed-off-by: Zefan Li <[email protected]>
alesaiko pushed a commit that referenced this issue Apr 22, 2018
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior
of waking their tasks up on idle CPUs. The feature has too much of a
negative impact on other workloads however to apply globally. The
PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this
flag set, or tasks woken by tasks with this flag set, on an idle CPU
if one is available.

Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039
Signed-off-by: Steve Muckle <[email protected]>

sched: add sysctl for controlling task migrations on wake

The PF_WAKE_UP_IDLE per-task flag made it impossible to enable
the old behavior of SD_SHARE_PKG_RESOURCES, where every task
migrates to an idle CPU on wakeup.

The sched_wake_to_idle sysctl value, when made nonzero, will cause
all tasks to migrate to an idle CPU if one is available when the
task is woken up. This is regardless of how PF_WAKE_UP_IDLE is
configured for tasks in the system. Similar to PF_WAKE_UP_IDLE,
the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled
for the sysctl value to have an effect.

Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61
Signed-off-by: Steve Muckle <[email protected]>
(cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282)

Conflicts:

	kernel/sched/fair.c

sched: Reset rq->next_interval before going idle

next_balance, the point in jiffy time scale when a cpu will next load
balance, could have been calculated when the cpu was busy. A busy cpu
will apply its sched domain's busy_factor (usually > 1) in computing
next_balance for that sched domain, which causes the (busy) cpu to load
balance less frequently in its sched domains. However when the same cpu
is going idle, its next_balance needs to be reset without consideration
of busy_factor. Failure to do so would not trigger nohz idle balancer on
that cpu for unnecessarily long time (introducing additional scheduling
latencies for tasks). Fix bug in scheduler which aims to reset
next_balance before a cpu goes idle (as per existing comment) but is
clearly not doing so.

Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
(just when they would have exceeded their ideal_runtime). It makes use
of a a per-cpu hrtimer resource and hence alarming that hrtimer should
be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs,
rather than being based on number of tasks in a particular cfs_rq (as
implemented currently). As a result, with current code, its possible for
a running task (which is the sole task in its cfs_rq) to be preempted
much after its ideal_runtime has elapsed, resulting in increased latency
for tasks in other cfs_rq on same cpu.

Fix this by alarming sched hrtimer based on total number of SCHED_FAIR
tasks a CPU has across its various cfs_rqs.

Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Mark schedule_io_timeout() with EXPORT_SYMBOL

Make schedule_io_timeout() visible to modules.

Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2
Signed-off-by: Jordan Crouse <[email protected]>

sched: fix reference to wrong cfs_rq

Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption
of tasks) introduced a bug in sched_slice() calculation by using wrong
cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather
than the correct one to which they belonged.

Fix the bug by using correct cfs_rq for tasks.

Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Make sure to not re-read variables after validation

We could re-read rq->rt_avg after we validated it was smaller than
total, invalidating the check and resulting in an unintended negative.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: David Rientjes <[email protected]>
Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
CRs-Fixed: 497236
Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
(cherry picked from commit fef154fa32c5d8ef992f11e11919495979970678)

Change-Id: Ieac7a10d9f4ceb701660f8055efd7a0ed5d939ca
Signed-off-by: Sridhar Nelloru <[email protected]>

sched: re-calculate a cpu's next_balance point upon sched domain changes

Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset
a cpu's rq->next_balance when pulled_task = 0, which will be true when
the cpu failed to pull any task, causing it go idle. However that patch
relied on next_balance being calculated as a result of traversing cpu's
sched domain hierarchy.

A cpu that is the only online cpu will however not be attached to any
sched domain hierarchy. When such a cpu calls into idle_balance(), we
will end up initializing next_balance to be 1sec away! Such a CPU will
defer load balance check for another 1sec, even though we may bring up
more cpus in the meantime requiring it to check for load imbalance more
frequently. This could then lead to increased scheduling latency for
some tasks.

This patch results in a cpu's next_balance being re-calculated when its
attaching to a new sched domain hierarchy.  This should let cpus call
load balance checks at the right time we expect them to!

Signed-off-by: Srivatsa Vaddagiri <[email protected]>
(cherry picked from commit 91b609bdd91513c37c410ffef0204f8a0578e4f8)

Change-Id: Ib457f1a357aad26e49837a01311b73e22e0cc26f
Signed-off-by: Sridhar Nelloru <[email protected]>

sched: Unthrottle rt runqueues in __disable_runtime()

migrate_tasks() uses _pick_next_task_rt() to get tasks from the
real-time runqueues to be migrated. When rt_rq is throttled
_pick_next_task_rt() won't return anything, in which case
migrate_tasks() can't move all threads over and gets stuck in an
infinite loop.

Instead unthrottle rt runqueues before migrating tasks.

Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()

Signed-off-by: Peter Boonstoppel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix clear NOHZ_BALANCE_KICK

I have faced a sequence where the Idle Load Balance was sometime not
triggered for a while on my platform, in the following scenario:

 CPU 0 and CPU 1 are running tasks and CPU 2 is idle

 CPU 1 kicks the Idle Load Balance
 CPU 1 selects CPU 2 as the new Idle Load Balancer
 CPU 2 sets NOHZ_BALANCE_KICK for CPU 2
 CPU 2 sends a reschedule IPI to CPU 2

 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2

 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
       task A quickly goes back to sleep (before a tick occurs on CPU 2)
 CPU 2 goes back to idle with NOHZ_BALANCE_KICK set

Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent
because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be
performed.

We must wait for the sched softirq to be raised on CPU 2 thanks to another
part the kernel to come back to clear NOHZ_BALANCE_KICK.

The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
we can't raise the sched_softirq for the Idle Load Balance.

Change since V1:

- move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB
  can't run on this CPU (as suggested by Peter)

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 873b4c65b519fd769940eb281f77848227d4e5c1
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: minor merge resolution for 3.4 in scheduler_ipi()]
Signed-off-by: Steve Muckle <[email protected]>
Change-Id: I3548612057cccc2ecc29429c129c44183083831f

sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s

try_to_wake_up_local() should only be invoked to wake up another
task in the same runqueue and BUG_ON()s are used to enforce the
rule. Missing try_to_wake_up_local() can stall workqueue
execution but such stalls are likely to be finite either by
another work item being queued or the one blocked getting
unblocked.  There's no reason to trigger BUG while holding rq
lock crashing the whole system.

Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s.

Change-Id: I75fdaaf4dcaefcf3893e0404d98c8aa91f89934d
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Git-commit: 383efcd00053ec40023010ce5034bd702e7ab373
CRs-Fixed: 519942
Signed-off-by: Steve Muckle <[email protected]>

sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local()

The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local()
were recently converted from BUG_ON() calls. If these hit it indicates
something is wrong and that may contribute to other system instability.
To eliminate the risk of an instance of one of these errors going
un-noticed because there was an earlier instance that occured long ago,
change to WARN_ON(). If there ever is a flood of these there are bigger
problems.

Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7
Signed-off-by: Steve Muckle <[email protected]>

sched: Convert WARN_ON() to printk_sched() in try_to_wake_up_local()

try_to_wake_up_local() is called with the rq lock held. Printing to
console in this context can result in a deadlock if klogd needs to
be woken up. Print to the kernel log buffer via printk_sched()
instead which avoids the wakeup.

Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c
Signed-off-by: Syed Rameez Mustafa <[email protected]>

sched/rt: Use root_domain of rt_rq not current processor

When the system has multiple domains do_sched_rt_period_timer()
can run on any CPU and may iterate over all rt_rq in
cpu_online_mask.  This means when balance_runtime() is run for a
given rt_rq that rt_rq may be in a different rd than the current
processor.  Thus if we use smp_processor_id() to get rd in
do_balance_runtime() we may borrow runtime from a rt_rq that is
not part of our rd.

This changes do_balance_runtime to get the rd from the passed in
rt_rq ensuring that we borrow runtime only from the correct rd
for the given rt_rq.

This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime
when we try reclaim runtime lent to other rt_rq but runtime has
been lent to a rt_rq in another rd.

Change-Id: Id8b9f88bdd783ef87bfa8d288a2fe7e86dd0b1e2
Signed-off-by: Shawn Bohrer <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Acked-by: Mike Galbraith <[email protected]>
Cc: [email protected]
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 89960feebaf4f9a53f93a0ce6888207e4a808799
Git-repo: git://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git
Signed-off-by: Syed Rameez Mustafa <[email protected]>

sched: Remove redundant update_runtime notifier

migration_call() will do all the things that update_runtime() does.
So let's remove it.

Furthermore, there is potential risk that the current code will catch
BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime
threads running because of enabling runtime twice while the rt_runtime
may already changed.

Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8
Signed-off-by: Neil Zhang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: resolved trivial header file context conflict]
Signed-off-by: Matt Wagantall <[email protected]>

sched: Fix hotplug vs. set_cpus_allowed_ptr()

Lai found that:

  WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b()
  ...
  migration_cpu_stop+0x1d/0x22

was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is
always a sub-set of cpu_online_mask.

This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness").

So set active and online at the same time to avoid this particular
problem.

CRs-Fixed: 680496
Change-Id: I89ac9b6829acf200072975bc7d028a469167f083
Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness")
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Gautham R. Shenoy <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael wang <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srivatsa S. Bhat <[email protected]>
Cc: Toshi Kani <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 24d52daafcd53f109893ec2faf668c8eeef4a382
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Matt Wagantall <[email protected]>

sched: Fix race between try_to_wake_up() and move_task()

Until a task's state has been seen as interruptible/uninterruptible
and it is no longer on_cpu, it is possible that the task may move
to another CPU (load balancing may cause this). Here is an example
where the race condition results in incorrect operation:

- cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING
- cpu 0 runs task B, which attempts to wake up A
- cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0
- cpu 1 then pulls task A (perhaps due to idle balance)
- cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE
- cpu 0 continues in try_to_wake_up(), thinking task A's previous
  cpu is 0, where it is actually 1
- if select_task_rq returns cpu 0, task A will be woken up on cpu 0
  without properly updating its cpu to 0 in set_task_cpu()

CRs-Fixed: 665958
Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99
Signed-off-by: Steve Muckle <[email protected]>

sched: Fix reference to stale task_struct in try_to_wake_up()

try_to_wake_up() currently drops p->pi_lock and later checks for need
to notify cpufreq governor on task migrations or wakeups. However the
woken task could exit between the time p->pi_lock is released and the
time the test for notification is run. As a result, the test for
notification could refer to an exited task. task_notify_on_migrate(p)
could thus lead to invalid memory reference.

Fix this by running the test for notification with task's pi_lock
held.

Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Bail out of yield_to when source and target runqueue has one task

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple handler upon
 seeing source runqueue length = 1, but it had to export rq length).
 Peter came up with the elegant idea of return -ESRCH in scheduler core.

Signed-off-by: Peter Zijlstra <[email protected]>
Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Andrew Jones <[email protected]>
Tested-by: Chegu Vinod <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>

Change-Id: I4eca8afc683c1a4b9d2490422baacd3da76c1e92

sched: Fix select_idle_sibling() bouncing cow syndrome

If the previous CPU is cache affine and idle, select it.

The current implementation simply traverses the sd_llc domain,
taking the first idle CPU encountered, which walks buddy pairs
hand in hand over the package, inflicting excruciating pain.

1 tbench pair (worst case) in a 10 core + SMT package:

  pre   15.22 MB/sec 1 procs
  post 252.01 MB/sec 1 procs

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix double normalization of vruntime

dequeue_entity() is called when p->on_rq and sets se->on_rq = 0
which appears to guarentee that the !se->on_rq condition is met.
If the task has done set_current_state(TASK_INTERRUPTIBLE) without
schedule() the second condition will be met and vruntime will be
incorrectly adjusted twice.

In certain cases this can result in the task's vruntime never increasing
past the vruntime of other tasks on the CFS' run queue, starving them of
CPU time.

This patch changes switched_from_fair() to use !p->on_rq instead of
!se->on_rq.

I'm able to cause a task with a priority of 120 to starve all other
tasks with the same priority on an ARM platform running 3.2.51-rt72
PREEMPT RT by writing one character at time to a serial tty (16550 UART)
in a tight loop. I'm also able to verify making this change corrects the
problem on that platform and kernel version.

Change-Id: I70aa0fe0e9be146b1a2a903d010b318390262dc3
Signed-off-by: George McCollister <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix divide by zero at {thread_group,task}_times

On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.

This problem can be triggered in practice on very long lived processes:

  PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
   #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
   #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
   #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
   #3 [ffff880472a51cd0] die at ffffffff8100f26b
   #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
   #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
   #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
      [exception RIP: thread_group_times+0x56]
      RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
      RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
      RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
      RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
      R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
   #8 [ffff880472a51f40] sys_times at ffffffff81088524
   #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
      RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
      RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
      RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
      RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
      R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
      ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b

Change-Id: I1e0fa1fa38b86566f5b20e5c0efa0c9eb17b203b
Signed-off-by: Stanislaw Gruszka <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched_clock: Prevent 64bit inatomicity on 32bit systems

The sched_clock_remote() implementation has the following inatomicity
problem on 32bit systems when accessing the remote scd->clock, which
is a 64bit value.

CPU0			    CPU1

sched_clock_local()	    sched_clock_remote(CPU0)
...
			    remote_clock = scd[CPU0]->clock
			    read_low32bit(scd[CPU0]->clock)
cmpxchg64(scd->clock,...)
			    read_high32bit(scd[CPU0]->clock)

While the update of scd->clock is using an atomic64 mechanism, the
readout on the remote cpu is not, which can cause completely bogus
readouts.

It is a quite rare problem, because it requires the update to hit the
narrow race window between the low/high readout and the update must go
across the 32bit boundary.

The resulting misbehaviour is, that CPU1 will see the sched_clock on
CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
to this bogus timestamp. This stays that way due to the clamping
implementation for about 4 seconds until the synchronization with
CLOCK_MONOTONIC undoes the problem.

The issue is hard to observe, because it might only result in a less
accurate SCHED_OTHER timeslicing behaviour. To create observable
damage on realtime scheduling classes, it is necessary that the bogus
update of CPU1 sched_clock happens in the context of an realtime
thread, which then gets charged 4 seconds of RT runtime, which results
in the RT throttler mechanism to trigger and prevent scheduling of RT
tasks for a little less than 4 seconds. So this is quite unlikely as
well.

The issue was quite hard to decode as the reproduction time is between
2 days and 3 weeks and intrusive tracing makes it less likely, but the
following trace recorded with trace_clock=global, which uses
sched_clock_local(), gave the final hint:

  <idle>-0   0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
  <idle>-0   0d..30 400269.477151: hrtimer_start:  hrtimer=0xf7061e80 ...
irq/20-S-587 1d..32 400273.772118: sched_wakeup:   comm= ... target_cpu=0
  <idle>-0   0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80

What happens is that CPU0 goes idle and invokes
sched_clock_idle_sleep_event() which invokes sched_clock_local() and
CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
sched_remote_clock(). The time jump gets propagated to CPU0 via
sched_remote_clock() and stays stale on both cores for ~4 seconds.

There are only two other possibilities, which could cause a stale
sched clock:

1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
   wrong value.

2) sched_clock() which reads the TSC returns a sporadic wrong value.

   one jiffy and then go stale.

   forward. It would just result in a stale sched_clock for one jiffy.

After quite some brain twisting and finding the same pattern on other
traces, sched_clock_remote() remained the only place which could cause
such a problem and as explained above it's indeed racy on 32bit
systems.

So while on 64bit systems the readout is atomic, we need to verify the
remote readout on 32bit machines. We need to protect the local->clock
readout in sched_clock_remote() on 32bit as well because an NMI could
hit between the low and the high readout, call sched_clock_local() and
modify local->clock.

Thanks to Siegfried Wulsch for bearing with my debug requests and
going through the tedious tasks of running a bunch of reproducer
systems to generate the debug information which let me decode the
issue.

Change-Id: I767a6a645e17651b39d0a530dc0681756f3b4f5d
Reported-by: Siegfried Wulsch <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities

This patch touches the RT group scheduling case.

Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
priority, while rt_rq passed to them may be not the top-level rt_rq.
This is wrong, because changing of priority on a child level does not
guarantee that the priority is the highest all over the rq. So, this
leak makes RT balancing unusable.

The short example: the task having the highest priority among all rq's
RT tasks (no one other task has the same priority) are waking on a
throttle rt_rq.  The rq's cpupri is set to the task's priority
equivalent, but real rq->rt.highest_prio.curr is less.

The patch below fixes the problem.

Change-Id: I01d3cc07870b1825efa1deed3b70f060ac20939f
Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
CC: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Guarantee new group-entities always have weight

Currently, group entity load-weights are initialized to zero. This
admits some races with respect to the first time they are re-weighted in
earlty use. ( Let g[x] denote the se for "g" on cpu "x". )

Suppose that we have root->a and that a enters a throttled state,
immediately followed by a[0]->t1 (the only task running on cpu[0])
blocking:

  put_prev_task(group_cfs_rq(a[0]), t1)
  put_prev_entity(..., t1)
  check_cfs_rq_runtime(group_cfs_rq(a[0]))
  throttle_cfs_rq(group_cfs_rq(a[0]))

Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
time:

  enqueue_task_fair(rq[0], t2)
  enqueue_entity(group_cfs_rq(b[0]), t2)
  enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
  account_entity_enqueue(group_cfs_ra(b[0]), t2)
  update_cfs_shares(group_cfs_rq(b[0]))
  < skipped because b is part of a throttled hierarchy >
  enqueue_entity(group_cfs_rq(a[0]), b[0])
  ...

We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
which violates invariants in several code-paths. Eliminate the
possibility of this by initializing group entity weight.

Change-Id: I1b09c5c4f85df1de565e7c862ba8dde65851c05f
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Chris J Arges <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Fix isolated CPUs leaving root_task_group indefinitely throttled

Root task group bandwidth replenishment must service all CPUs, regardless of
where the timer was last started, and regardless of the isolation mechanism,
lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.

Change-Id: I8485724e3ff0960008aa13634e55bcb6ceacbc01
Signed-off-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Avoid updating RT entry timeout twice within one tick period

The issue below was found in 2.6.34-rt rather than mainline rt kernel, but
the issue still exists upstream as well.

So please let me describe how it was noticed on 2.6.34-rt:

On this version, each softirq has its own thread, which means there is at
least one RT FIFO task per cpu.  The priority of these tasks is set to 49
by default.  If user launches an RT FIFO task with priority lower than 49
of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued
one cpu runqueue at one moment. By current strategy of balancing RT tasks,
when it comes to RT tasks, we really need to put them off to a CPU that
they can run on as soon as possible. Even if it means a bit of cache line
flushing, we want RT tasks to be run with the least latency.

When the user RT FIFO task which just launched before is running, the
sched timer tick of the current cpu happens. In this tick period, the
timeout value of the user RT task will be updated once. Subsequently,
we try to wake up one softirq RT task on its local cpu. As the priority
of current user RT task is lower than the softirq RT task, the current
task will be preempted by the higher priority softirq RT task.  Before
preemption, we check to see if current can readily move to a different
cpu. If so, we will reschedule to allow the RT push logic to try to move
current somewhere else. Whenever the woken softirq RT task runs, it first
tries to migrate the user FIFO RT task over to a cpu that is running a
task of lesser priority. If migration is done, it will send a reschedule
request to the found cpu by IPI interrupt. Once the target cpu responds
the IPI interrupt, it will pick the migrated user RT task to preempt its
current task. When the user RT task is running on the new cpu, the sched
timer tick of the cpu fires. So it will tick the user RT task again. This
also means the RT task timeout value will be updated again.  As the
migration may be done in one tick period, it means the user RT task
timeout value will be updated twice within one tick.

If we set a limit on the amount of cpu time for the user RT task by
setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon
reaching the soft limit.

But exactly when the SIGXCPU signal should be sent depends on the RT task
timeout value. In fact the timeout mechanism of sending the SIGXCPU signal
assumes the RT task timeout is increased once every tick.

However, currently the timeout value may be added twice per tick. So it
results in the SIGXCPU signal being sent earlier than expected.

To solve this issue, we prevent the timeout value from increasing twice
within one tick time by remembering the jiffies value of last updating the
timeout. As long as the RT task's jiffies is different with the global
jiffies value, we allow its timeout to be updated.

Signed-off-by: Ying Xue <[email protected]>
Signed-off-by: Fan Du <[email protected]>
Reviewed-by: Yong Zhang <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
[ lizf: backported to 3.4: adjust context ]
Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Change-Id: I1d126959d88f3bc1adb4a707661efb7162048424

sched: rt: Reduce rq lock contention by eliminating locking of non-feasible target

This patch adds checks that prevens futile attempts to move rt tasks to a
CPU with active tasks of equal or higher priority.

This reduces run queue lock contention and improves the performance of a
well known OLTP benchmark by 0.7%.

Change-Id: I232abe50623593d51be508fc64e8b865dcc77212
Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Shawn Bohrer <[email protected]>
Cc: Suruchi Kadu <[email protected]>
Cc: Doug Nelson<[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

sched: Fix RLIMIT_RTTIME when PI-boosting to RT

When non-realtime tasks get priority-inheritance boosted to a realtime
scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
counter used for checking this (the same one used for SCHED_RR
timeslices) was not getting reset. This meant that tasks running with a
non-realtime scheduling class which are repeatedly boosted to a realtime
one, but never block while they are running realtime, eventually hit the
timeout without ever running for a time over the limit. This patch
resets the realtime timeslice counter when un-PI-boosting from an RT to
a non-RT scheduling class.

I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
mutex which induces priority boosting and spins while boosted that gets
killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
does happen eventually with PREEMPT_VOLUNTARY kernels.

Change-Id: I732036119517d9a9625cea4644b4509c1d6cdf13
Signed-off-by: Brian Silverman <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
[lizf: Backported to 3.4: adjust contest]
Signed-off-by: Zefan Li <[email protected]>

sched: Queue RT tasks to head when prio drops

The following scenario does not work correctly:

Runqueue of CPUx contains two runnable and pinned tasks:

 T1: SCHED_FIFO, prio 80
 T2: SCHED_FIFO, prio 80

T1 is on the cpu and executes the following syscalls (classic priority
ceiling scenario):

 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90);
 ...
 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80);
 ...

Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back
to sleep the scheduler picks T2. Surprise!

The same happens w/o actual preemption when T1 is forced into the
scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes
pick_next_task() which returns T2. So T1 gets preempted and scheduled
out.

This happens because sched_setscheduler() dequeues T1 from the prio 90
list and then enqueues it on the tail of the prio 80 list behind T2.
This violates the POSIX spec and surprises user space which relies on
the guarantee that SCHED_FIFO tasks are not scheduled out unless they
give the CPU up voluntarily or are preempted by a higher priority
task. In the latter case the preempted task must get back on the CPU
after the preempting task schedules out again.

We fixed a similar issue already in commit 60db48c (sched: Queue a
deboosted task to the head of the RT prio queue). The same treatment
is necessary for sched_setscheduler(). So enqueue to head of the prio
bucket list if the priority of the task is lowered.

It might be possible that existing user space relies on the current
behaviour, but it can be considered highly unlikely due to the corner
case nature of the application scenario.

Change-Id: I65f159b3c2d265dcaf27b70bf9be88cf922ab845
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

sched: Fix the theoretical signal_wake_up() vs schedule() race

This is only theoretical, but after try_to_wake_up(p) was changed to check
p->state under p->pi_lock the code like:

	__set_current_state(TASK_INTERRUPTIBLE);
	schedule();

can miss a signal.  This is the special case of wait-for-condition, it
relies on try_to_wake_up/schedule interaction and thus it does not need
mb() between __set_current_state() and if (signal_pending).

However, this __set_current_state() can move into the critical section
protected by rq->lock, now that try_to_wake_up() takes another lock we
need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.

The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock).  This is what try_to_wake_up() already
does by the same reason.

We turn this wmb() into the new helper, smp_mb__before_spinlock(), for
better documentation and to allow the architectures to change the default
implementation.

While at it, kill smp_mb__after_lock(), it has no callers.

Perhaps we can also add smp_mb__before/after_spinunlock() for
prepare_to_wait().

Change-Id: I2b329b2ae213cc9221811ca244d2ebb044011a1b
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix race in idle_balance()

The scheduler main function 'schedule()' checks if there are no more tasks
on the runqueue. Then it checks if a task should be pulled in the current
runqueue in idle_balance() assuming it will go to idle otherwise.

But idle_balance() releases the rq->lock in order to look up the sched
domains and takes the lock again right after. That opens a window where
another cpu may put a task in our runqueue, so we won't go to idle but
we have filled the idle_stamp, thinking we will.

This patch closes the window by checking if the runqueue has been modified
but without pulling a task after taking the lock again, so we won't go to idle
right after in the __schedule() function.

Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8
Signed-off-by: Daniel Lezcano <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix load avg vs cpu-hotplug

Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.

Reported-by: Rakib Mullick <[email protected]>
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix load avg vs. cpu-hotplug

Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an
incomplete fix:

In particular, the problem is that at the point it calls
calc_load_migrate() nr_running := 1 (the stopper thread), so move the
call to CPU_DEAD where we're sure that nr_running := 0.

Also note that we can call calc_load_migrate() without serialization, we
know the state of rq is stable since its cpu is dead, and we modify the
global state using appropriate atomic ops.

Change-Id: I24063eebf4429d7e01838af2cb68e0f3c2cf2131
Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>

sched/nohz: Rewrite and fix load-avg computation -- again

Thanks to Charles Wang for spotting the defects in the current code:

 - If we go idle during the sample window -- after sampling, we get a
   negative bias because we can negate our own sample.

 - If we wake up during the sample window we get a positive bias
   because we push the sample to a known active period.

So rewrite the entire nohz load-avg muck once again, now adding
copious documentation to the code.

Reported-and-tested-by: Doug Smythies <[email protected]>
Reported-and-tested-by: Charles Wang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
[ minor edits ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Change-Id: Ib735002c79eaace9fa5f7795fdc2f80b638b56a7

sched: Fix the broken sched_rr_get_interval()

The caller of sched_sliced() should pass se.cfs_rq and se as the
arguments, however in sched_rr_get_interval() we gave it
rq.cfs_rq and se, which made the following computation obviously
wrong.

The change was introduced by commit:

  77034937dc45 sched: fix crash in sys_sched_rr_get_interval()

... 5 years ago, while it had been the correct 'cfs_rq_of' before
the commit. The change seems to be irrelevant to the commit
msg, which was to return a 0 timeslice for tasks that are on an
idle runqueue. So I believe that was just a plain typo.

Signed-off-by: Zhu Yanhai <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Since this is an ABI and an old bug, we'll test this via a
  slow upstream route, to hopefully discover any app breakage. ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix race in task_group()

Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call
task_group() too many times in set_task_rq()").  He found the reason to be
that the multiple task_group() invocations in set_task_rq() returned
different values.

Looking at all that I found a lack of serialization and plain wrong
comments.

The below tries to fix it using an extra pointer which is updated under
the appropriate scheduler locks. Its not pretty, but I can't really see
another way given how all the cgroup stuff works.

Change-Id: I6906c92ea6f66d19af5938a9f5ec9f1ff4577c39
Reported-and-tested-by: Stefan Bader <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume

In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.

However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.

In order to achieve that, do the following:

1. Don't modify cpusets during suspend/resume. At all.
   In particular, don't move the tasks from one cpuset to another, and
   don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
   during the CPU hotplug operations that are carried out in the
   suspend/resume path.

2. However, cpusets and sched domains are related. We just want to avoid
   altering cpusets alone. So, to keep the sched domains updated, build
   a single sched domain (containing all active cpus) during each of the
   CPU hotplug operations carried out in s/r path, effectively ignoring
   the cpusets' cpus_allowed masks.

   (Since userspace is frozen while doing all this, it will go unnoticed.)

3. During the last CPU online operation during resume, build the sched
   domains by looking up the (unaltered) cpusets' cpus_allowed masks.
   That will bring back the system to the same original state as it was in
   before suspend.

Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.

Change-Id: I12c3e6289d5aa069531f8f616d63412b38103d99
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Preeti U Murthy <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones

There is a small race between copy_process() and cgroup_attach_task()
where child->se.parent,cfs_rq points to invalid (old) ones.

        parent doing fork()      | someone moving the parent to another cgroup
  -------------------------------+---------------------------------------------
    copy_process()
      + dup_task_struct()
        -> parent->se is copied to child->se.
           se.parent,cfs_rq of them point to old ones.

                                     cgroup_attach_task()
                                       + cgroup_task_migrate()
                                         -> parent->cgroup is updated.
                                       + cpu_cgroup_attach()
                                         + sched_move_task()
                                           + task_move_group_fair()
                                             +- set_task_rq()
                                                -> se.parent,cfs_rq of parent
                                                   are updated.

      + cgroup_fork()
        -> parent->cgroup is copied to child->cgroup. (*1)
      + sched_fork()
        + task_fork_fair()
          -> se.parent,cfs_rq of child are accessed
             while they point to old ones. (*2)

In the worst case, this bug can lead to "use-after-free" and cause a panic,
because it's new cgroup's refcount that is incremented at (*1),
so the old cgroup(and related data) can be freed before (*2).

In fact, a panic caused by this bug was originally caught in RHEL6.4.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0
    [...]
    Call Trace:
     [<ffffffff81051f25>] place_entity+0x75/0xa0
     [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160
     [<ffffffff81063c0b>] sched_fork+0x6b/0x140
     [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450
     [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130
     [<ffffffff8106d2f4>] do_fork+0x94/0x460
     [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100
     [<ffffffff81009598>] sys_clone+0x28/0x30
     [<ffffffff8100b393>] stub_clone+0x13/0x20
     [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

Signed-off-by: Daisuke Nishimura <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/nohz: Fix rq->cpu_load[] calculations

While investigating why the load-balancer did funny I found that the
rq->cpu_load[] tables were completely screwy.. a bit more digging
revealed that the updates that got through were missing ticks followed
by a catchup of 2 ticks.

The catchup assumes the cpu was idle during that time (since only nohz
can cause missed ticks and the machine is idle etc..) this means that
esp. the higher indices were significantly lower than they ought to
be.

The reason for this is that its not correct to compare against jiffies
on every jiffy on any other cpu than the cpu that updates jiffies.

This patch cludges around it by only doing the catch-up stuff from
nohz_idle_balance() and doing the regular stuff unconditionally from
the tick.

Change-Id: Iac6eda1ded5074afb261ccc0fe8a4e70432bb031
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Venkatesh Pallipadi <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/nohz: Fix rq->cpu_load calculations some more

commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream.

Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[]
calculations") since while that fixed the busy case it regressed the
mostly idle case.

Add a callback from the nohz exit to also age the rq->cpu_load[]
array. This closes the hole where either there was no nohz load
balance pass during the nohz, or there was a 'significant' amount of
idle time between the last nohz balance and the nohz exit.

So we'll update unconditionally from the tick to not insert any
accidental 0 load periods while busy, and we try and catch up from
nohz idle balance and nohz exit. Both these are still prone to missing
a jiffy, but that has always been the case.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Venkatesh Pallipadi <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/core: Clear the root_domain cpumasks in init_rootdomain()

root_domain::rto_mask allocated through alloc_cpumask_var() contains
garbage data, and this may cause problems.  For instance, when doing
pull_rt_task(), it may do useless iterations if rto_mask retains some
extra garbage bits. Worse still, this violates the isolated domain rule
for clustered scheduling using cpuset, because the tasks (with all the
cpus allowed) belongs to one root domain can be pulled away into another
root domain.

The patch cleans the garbage by using zalloc_cpumask_var() instead of
alloc_cpumask_var() for root_domain::rto_mask allocation, thereby
addressing the issues.

Do the same thing for root_domain's other cpumask memembers:
- span,
- online;

Signed-off-by: Xunlei Pang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

softirq: Reduce latencies

In various network workloads, __do_softirq() latencies can be up to 20 ms
if HZ=1000, and 200 ms if HZ=100.

This is because we iterate 10 times in the softirq dispatcher, and some
actions can consume a lot of cycles.

This patch changes the fallback to ksoftirqd condition to:
- A time limit of 2 ms,
- need_resched() being set on current task;

When one of this condition is met, we wakeup ksoftirqd for further softirq
processing if we still have pending softirqs.

Using need_resched() as the only condition can trigger RCU stalls, as we
can keep BH disabled for too long.

I ran several benchmarks and got no significant difference in throughput,
but a very significant reduction of latencies (one order of magnitude):

In following bench, 200 antagonist "netperf -t TCP_RR" are started in
background, using all available cpus.

Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC
IRQ (hard + soft).

Before patch:

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=550110.424
MIN_LATENCY=146858
MAX_LATENCY=997109
P50_LATENCY=305000
P90_LATENCY=550000
P99_LATENCY=710000
MEAN_LATENCY=376989.12
STDDEV_LATENCY=184046.92

After patch:

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=40545.492
MIN_LATENCY=9834
MAX_LATENCY=78366
P50_LATENCY=33583
P90_LATENCY=59000
P99_LATENCY=69000
MEAN_LATENCY=38364.67
STDDEV_LATENCY=12865.26

Signed-off-by: Eric Dumazet <[email protected]>
Cc: David Miller <[email protected]>
Cc: Tom Herbert <[email protected]>
Cc: Ben Hutchings <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

softirq: Fix lockup related to stop_machine being stuck in __do_softirq.

The stop machine logic can lock up if all but one of the migration threads
make it through the disable-irq step and the one remaining thread gets
stuck in __do_softirq. The reason __do_softirq can hang is that it has a
bail-out based on jiffies timeout, but in the lockup case, jiffies itself
is not incremented.

To work around this, re-add the max_restart counter in __do_irq and stop
processing irqs after 10 restarts.

Thanks to Tejun Heo and Rusty Russell and others for helping me track this
down.

This was introduced in 3.9 by commit c10d73671ad3
("softirq: reduce latencies").

It may be worth looking into ath9k to see if it has issues with its irq
handler at a later date.

The hang stack traces look something like this:

    ------------[ cut here ]------------
    WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7()
    Watchdog detected hard LOCKUP on cpu 2
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    Pid: 23, comm: migration/2 Tainted: G         C   3.9.4+ #11
    Call Trace:
     <NMI>   warn_slowpath_common+0x85/0x9f
      warn_slowpath_fmt+0x46/0x48
      watchdog_overflow_callback+0x9c/0xa7
      __perf_event_overflow+0x137/0x1cb
      perf_event_overflow+0x14/0x16
      intel_pmu_handle_irq+0x2dc/0x359
      perf_event_nmi_handler+0x19/0x1b
      nmi_handle+0x7f/0xc2
      do_nmi+0xbc/0x304
      end_repeat_nmi+0x1e/0x2e
     <<EOE>>
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0
    ---[ end trace 4947dfa9b0a4cec3 ]---
    BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17]
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    irq event stamp: 835637905
    hardirqs last  enabled at (835637904): __do_softirq+0x9f/0x257
    hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80
    softirqs last  enabled at (5654720): __do_softirq+0x1ff/0x257
    softirqs last disabled at (5654725): irq_exit+0x5f/0xbb
    CPU 1
    Pid: 17, comm: migration/1 Tainted: G        WC   3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
    RIP: tasklet_hi_action+0xf0/0xf0
    Process migration/1
    Call Trace:
     <IRQ>
      __do_softirq+0x117/0x257
      irq_exit+0x5f/0xbb
      smp_apic_timer_interrupt+0x8a/0x98
      apic_timer_interrupt+0x72/0x80
     <EOI>
      printk+0x4d/0x4f
      stop_machine_cpu_stop+0x22c/0x274
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0

Change-Id: Ib213d5a5c59c85ae16afde6f42a01e39d2c93fd7
Signed-off-by: Ben Greear <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Pekka Riikonen <[email protected]>
Cc: Eric Dumazet <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <[email protected]>
Signed-off-by: Zefan Li <[email protected]>
alesaiko pushed a commit that referenced this issue Apr 24, 2018
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior
of waking their tasks up on idle CPUs. The feature has too much of a
negative impact on other workloads however to apply globally. The
PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this
flag set, or tasks woken by tasks with this flag set, on an idle CPU
if one is available.

Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039
Signed-off-by: Steve Muckle <[email protected]>

sched: add sysctl for controlling task migrations on wake

The PF_WAKE_UP_IDLE per-task flag made it impossible to enable
the old behavior of SD_SHARE_PKG_RESOURCES, where every task
migrates to an idle CPU on wakeup.

The sched_wake_to_idle sysctl value, when made nonzero, will cause
all tasks to migrate to an idle CPU if one is available when the
task is woken up. This is regardless of how PF_WAKE_UP_IDLE is
configured for tasks in the system. Similar to PF_WAKE_UP_IDLE,
the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled
for the sysctl value to have an effect.

Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61
Signed-off-by: Steve Muckle <[email protected]>
(cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282)

Conflicts:

	kernel/sched/fair.c

sched: Reset rq->next_interval before going idle

next_balance, the point in jiffy time scale when a cpu will next load
balance, could have been calculated when the cpu was busy. A busy cpu
will apply its sched domain's busy_factor (usually > 1) in computing
next_balance for that sched domain, which causes the (busy) cpu to load
balance less frequently in its sched domains. However when the same cpu
is going idle, its next_balance needs to be reset without consideration
of busy_factor. Failure to do so would not trigger nohz idle balancer on
that cpu for unnecessarily long time (introducing additional scheduling
latencies for tasks). Fix bug in scheduler which aims to reset
next_balance before a cpu goes idle (as per existing comment) but is
clearly not doing so.

Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
(just when they would have exceeded their ideal_runtime). It makes use
of a a per-cpu hrtimer resource and hence alarming that hrtimer should
be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs,
rather than being based on number of tasks in a particular cfs_rq (as
implemented currently). As a result, with current code, its possible for
a running task (which is the sole task in its cfs_rq) to be preempted
much after its ideal_runtime has elapsed, resulting in increased latency
for tasks in other cfs_rq on same cpu.

Fix this by alarming sched hrtimer based on total number of SCHED_FAIR
tasks a CPU has across its various cfs_rqs.

Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Mark schedule_io_timeout() with EXPORT_SYMBOL

Make schedule_io_timeout() visible to modules.

Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2
Signed-off-by: Jordan Crouse <[email protected]>

sched: fix reference to wrong cfs_rq

Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption
of tasks) introduced a bug in sched_slice() calculation by using wrong
cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather
than the correct one to which they belonged.

Fix the bug by using correct cfs_rq for tasks.

Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Make sure to not re-read variables after validation

We could re-read rq->rt_avg after we validated it was smaller than
total, invalidating the check and resulting in an unintended negative.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: David Rientjes <[email protected]>
Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
CRs-Fixed: 497236
Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
(cherry picked from commit fef154fa32c5d8ef992f11e11919495979970678)

Change-Id: Ieac7a10d9f4ceb701660f8055efd7a0ed5d939ca
Signed-off-by: Sridhar Nelloru <[email protected]>

sched: re-calculate a cpu's next_balance point upon sched domain changes

Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset
a cpu's rq->next_balance when pulled_task = 0, which will be true when
the cpu failed to pull any task, causing it go idle. However that patch
relied on next_balance being calculated as a result of traversing cpu's
sched domain hierarchy.

A cpu that is the only online cpu will however not be attached to any
sched domain hierarchy. When such a cpu calls into idle_balance(), we
will end up initializing next_balance to be 1sec away! Such a CPU will
defer load balance check for another 1sec, even though we may bring up
more cpus in the meantime requiring it to check for load imbalance more
frequently. This could then lead to increased scheduling latency for
some tasks.

This patch results in a cpu's next_balance being re-calculated when its
attaching to a new sched domain hierarchy.  This should let cpus call
load balance checks at the right time we expect them to!

Signed-off-by: Srivatsa Vaddagiri <[email protected]>
(cherry picked from commit 91b609bdd91513c37c410ffef0204f8a0578e4f8)

Change-Id: Ib457f1a357aad26e49837a01311b73e22e0cc26f
Signed-off-by: Sridhar Nelloru <[email protected]>

sched: Unthrottle rt runqueues in __disable_runtime()

migrate_tasks() uses _pick_next_task_rt() to get tasks from the
real-time runqueues to be migrated. When rt_rq is throttled
_pick_next_task_rt() won't return anything, in which case
migrate_tasks() can't move all threads over and gets stuck in an
infinite loop.

Instead unthrottle rt runqueues before migrating tasks.

Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()

Signed-off-by: Peter Boonstoppel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix clear NOHZ_BALANCE_KICK

I have faced a sequence where the Idle Load Balance was sometime not
triggered for a while on my platform, in the following scenario:

 CPU 0 and CPU 1 are running tasks and CPU 2 is idle

 CPU 1 kicks the Idle Load Balance
 CPU 1 selects CPU 2 as the new Idle Load Balancer
 CPU 2 sets NOHZ_BALANCE_KICK for CPU 2
 CPU 2 sends a reschedule IPI to CPU 2

 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2

 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
       task A quickly goes back to sleep (before a tick occurs on CPU 2)
 CPU 2 goes back to idle with NOHZ_BALANCE_KICK set

Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent
because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be
performed.

We must wait for the sched softirq to be raised on CPU 2 thanks to another
part the kernel to come back to clear NOHZ_BALANCE_KICK.

The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
we can't raise the sched_softirq for the Idle Load Balance.

Change since V1:

- move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB
  can't run on this CPU (as suggested by Peter)

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 873b4c65b519fd769940eb281f77848227d4e5c1
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: minor merge resolution for 3.4 in scheduler_ipi()]
Signed-off-by: Steve Muckle <[email protected]>
Change-Id: I3548612057cccc2ecc29429c129c44183083831f

sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s

try_to_wake_up_local() should only be invoked to wake up another
task in the same runqueue and BUG_ON()s are used to enforce the
rule. Missing try_to_wake_up_local() can stall workqueue
execution but such stalls are likely to be finite either by
another work item being queued or the one blocked getting
unblocked.  There's no reason to trigger BUG while holding rq
lock crashing the whole system.

Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s.

Change-Id: I75fdaaf4dcaefcf3893e0404d98c8aa91f89934d
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Git-commit: 383efcd00053ec40023010ce5034bd702e7ab373
CRs-Fixed: 519942
Signed-off-by: Steve Muckle <[email protected]>

sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local()

The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local()
were recently converted from BUG_ON() calls. If these hit it indicates
something is wrong and that may contribute to other system instability.
To eliminate the risk of an instance of one of these errors going
un-noticed because there was an earlier instance that occured long ago,
change to WARN_ON(). If there ever is a flood of these there are bigger
problems.

Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7
Signed-off-by: Steve Muckle <[email protected]>

sched: Convert WARN_ON() to printk_sched() in try_to_wake_up_local()

try_to_wake_up_local() is called with the rq lock held. Printing to
console in this context can result in a deadlock if klogd needs to
be woken up. Print to the kernel log buffer via printk_sched()
instead which avoids the wakeup.

Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c
Signed-off-by: Syed Rameez Mustafa <[email protected]>

sched/rt: Use root_domain of rt_rq not current processor

When the system has multiple domains do_sched_rt_period_timer()
can run on any CPU and may iterate over all rt_rq in
cpu_online_mask.  This means when balance_runtime() is run for a
given rt_rq that rt_rq may be in a different rd than the current
processor.  Thus if we use smp_processor_id() to get rd in
do_balance_runtime() we may borrow runtime from a rt_rq that is
not part of our rd.

This changes do_balance_runtime to get the rd from the passed in
rt_rq ensuring that we borrow runtime only from the correct rd
for the given rt_rq.

This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime
when we try reclaim runtime lent to other rt_rq but runtime has
been lent to a rt_rq in another rd.

Change-Id: Id8b9f88bdd783ef87bfa8d288a2fe7e86dd0b1e2
Signed-off-by: Shawn Bohrer <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Acked-by: Mike Galbraith <[email protected]>
Cc: [email protected]
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 89960feebaf4f9a53f93a0ce6888207e4a808799
Git-repo: git://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git
Signed-off-by: Syed Rameez Mustafa <[email protected]>

sched: Remove redundant update_runtime notifier

migration_call() will do all the things that update_runtime() does.
So let's remove it.

Furthermore, there is potential risk that the current code will catch
BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime
threads running because of enabling runtime twice while the rt_runtime
may already changed.

Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8
Signed-off-by: Neil Zhang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: resolved trivial header file context conflict]
Signed-off-by: Matt Wagantall <[email protected]>

sched: Fix hotplug vs. set_cpus_allowed_ptr()

Lai found that:

  WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b()
  ...
  migration_cpu_stop+0x1d/0x22

was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is
always a sub-set of cpu_online_mask.

This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness").

So set active and online at the same time to avoid this particular
problem.

CRs-Fixed: 680496
Change-Id: I89ac9b6829acf200072975bc7d028a469167f083
Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness")
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Gautham R. Shenoy <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael wang <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srivatsa S. Bhat <[email protected]>
Cc: Toshi Kani <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Git-commit: 24d52daafcd53f109893ec2faf668c8eeef4a382
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Matt Wagantall <[email protected]>

sched: Fix race between try_to_wake_up() and move_task()

Until a task's state has been seen as interruptible/uninterruptible
and it is no longer on_cpu, it is possible that the task may move
to another CPU (load balancing may cause this). Here is an example
where the race condition results in incorrect operation:

- cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING
- cpu 0 runs task B, which attempts to wake up A
- cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0
- cpu 1 then pulls task A (perhaps due to idle balance)
- cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE
- cpu 0 continues in try_to_wake_up(), thinking task A's previous
  cpu is 0, where it is actually 1
- if select_task_rq returns cpu 0, task A will be woken up on cpu 0
  without properly updating its cpu to 0 in set_task_cpu()

CRs-Fixed: 665958
Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99
Signed-off-by: Steve Muckle <[email protected]>

sched: Fix reference to stale task_struct in try_to_wake_up()

try_to_wake_up() currently drops p->pi_lock and later checks for need
to notify cpufreq governor on task migrations or wakeups. However the
woken task could exit between the time p->pi_lock is released and the
time the test for notification is run. As a result, the test for
notification could refer to an exited task. task_notify_on_migrate(p)
could thus lead to invalid memory reference.

Fix this by running the test for notification with task's pi_lock
held.

Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

sched: Bail out of yield_to when source and target runqueue has one task

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple handler upon
 seeing source runqueue length = 1, but it had to export rq length).
 Peter came up with the elegant idea of return -ESRCH in scheduler core.

Signed-off-by: Peter Zijlstra <[email protected]>
Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Andrew Jones <[email protected]>
Tested-by: Chegu Vinod <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>

Change-Id: I4eca8afc683c1a4b9d2490422baacd3da76c1e92

sched: Fix select_idle_sibling() bouncing cow syndrome

If the previous CPU is cache affine and idle, select it.

The current implementation simply traverses the sd_llc domain,
taking the first idle CPU encountered, which walks buddy pairs
hand in hand over the package, inflicting excruciating pain.

1 tbench pair (worst case) in a 10 core + SMT package:

  pre   15.22 MB/sec 1 procs
  post 252.01 MB/sec 1 procs

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix double normalization of vruntime

dequeue_entity() is called when p->on_rq and sets se->on_rq = 0
which appears to guarentee that the !se->on_rq condition is met.
If the task has done set_current_state(TASK_INTERRUPTIBLE) without
schedule() the second condition will be met and vruntime will be
incorrectly adjusted twice.

In certain cases this can result in the task's vruntime never increasing
past the vruntime of other tasks on the CFS' run queue, starving them of
CPU time.

This patch changes switched_from_fair() to use !p->on_rq instead of
!se->on_rq.

I'm able to cause a task with a priority of 120 to starve all other
tasks with the same priority on an ARM platform running 3.2.51-rt72
PREEMPT RT by writing one character at time to a serial tty (16550 UART)
in a tight loop. I'm also able to verify making this change corrects the
problem on that platform and kernel version.

Change-Id: I70aa0fe0e9be146b1a2a903d010b318390262dc3
Signed-off-by: George McCollister <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix divide by zero at {thread_group,task}_times

On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.

This problem can be triggered in practice on very long lived processes:

  PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
   #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
   #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
   #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
   #3 [ffff880472a51cd0] die at ffffffff8100f26b
   #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
   #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
   #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
      [exception RIP: thread_group_times+0x56]
      RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
      RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
      RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
      RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
      R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
   #8 [ffff880472a51f40] sys_times at ffffffff81088524
   #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
      RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
      RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
      RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
      RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
      R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
      ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b

Change-Id: I1e0fa1fa38b86566f5b20e5c0efa0c9eb17b203b
Signed-off-by: Stanislaw Gruszka <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched_clock: Prevent 64bit inatomicity on 32bit systems

The sched_clock_remote() implementation has the following inatomicity
problem on 32bit systems when accessing the remote scd->clock, which
is a 64bit value.

CPU0			    CPU1

sched_clock_local()	    sched_clock_remote(CPU0)
...
			    remote_clock = scd[CPU0]->clock
			    read_low32bit(scd[CPU0]->clock)
cmpxchg64(scd->clock,...)
			    read_high32bit(scd[CPU0]->clock)

While the update of scd->clock is using an atomic64 mechanism, the
readout on the remote cpu is not, which can cause completely bogus
readouts.

It is a quite rare problem, because it requires the update to hit the
narrow race window between the low/high readout and the update must go
across the 32bit boundary.

The resulting misbehaviour is, that CPU1 will see the sched_clock on
CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
to this bogus timestamp. This stays that way due to the clamping
implementation for about 4 seconds until the synchronization with
CLOCK_MONOTONIC undoes the problem.

The issue is hard to observe, because it might only result in a less
accurate SCHED_OTHER timeslicing behaviour. To create observable
damage on realtime scheduling classes, it is necessary that the bogus
update of CPU1 sched_clock happens in the context of an realtime
thread, which then gets charged 4 seconds of RT runtime, which results
in the RT throttler mechanism to trigger and prevent scheduling of RT
tasks for a little less than 4 seconds. So this is quite unlikely as
well.

The issue was quite hard to decode as the reproduction time is between
2 days and 3 weeks and intrusive tracing makes it less likely, but the
following trace recorded with trace_clock=global, which uses
sched_clock_local(), gave the final hint:

  <idle>-0   0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
  <idle>-0   0d..30 400269.477151: hrtimer_start:  hrtimer=0xf7061e80 ...
irq/20-S-587 1d..32 400273.772118: sched_wakeup:   comm= ... target_cpu=0
  <idle>-0   0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80

What happens is that CPU0 goes idle and invokes
sched_clock_idle_sleep_event() which invokes sched_clock_local() and
CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
sched_remote_clock(). The time jump gets propagated to CPU0 via
sched_remote_clock() and stays stale on both cores for ~4 seconds.

There are only two other possibilities, which could cause a stale
sched clock:

1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
   wrong value.

2) sched_clock() which reads the TSC returns a sporadic wrong value.

   one jiffy and then go stale.

   forward. It would just result in a stale sched_clock for one jiffy.

After quite some brain twisting and finding the same pattern on other
traces, sched_clock_remote() remained the only place which could cause
such a problem and as explained above it's indeed racy on 32bit
systems.

So while on 64bit systems the readout is atomic, we need to verify the
remote readout on 32bit machines. We need to protect the local->clock
readout in sched_clock_remote() on 32bit as well because an NMI could
hit between the low and the high readout, call sched_clock_local() and
modify local->clock.

Thanks to Siegfried Wulsch for bearing with my debug requests and
going through the tedious tasks of running a bunch of reproducer
systems to generate the debug information which let me decode the
issue.

Change-Id: I767a6a645e17651b39d0a530dc0681756f3b4f5d
Reported-by: Siegfried Wulsch <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities

This patch touches the RT group scheduling case.

Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
priority, while rt_rq passed to them may be not the top-level rt_rq.
This is wrong, because changing of priority on a child level does not
guarantee that the priority is the highest all over the rq. So, this
leak makes RT balancing unusable.

The short example: the task having the highest priority among all rq's
RT tasks (no one other task has the same priority) are waking on a
throttle rt_rq.  The rq's cpupri is set to the task's priority
equivalent, but real rq->rt.highest_prio.curr is less.

The patch below fixes the problem.

Change-Id: I01d3cc07870b1825efa1deed3b70f060ac20939f
Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
CC: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Guarantee new group-entities always have weight

Currently, group entity load-weights are initialized to zero. This
admits some races with respect to the first time they are re-weighted in
earlty use. ( Let g[x] denote the se for "g" on cpu "x". )

Suppose that we have root->a and that a enters a throttled state,
immediately followed by a[0]->t1 (the only task running on cpu[0])
blocking:

  put_prev_task(group_cfs_rq(a[0]), t1)
  put_prev_entity(..., t1)
  check_cfs_rq_runtime(group_cfs_rq(a[0]))
  throttle_cfs_rq(group_cfs_rq(a[0]))

Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
time:

  enqueue_task_fair(rq[0], t2)
  enqueue_entity(group_cfs_rq(b[0]), t2)
  enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
  account_entity_enqueue(group_cfs_ra(b[0]), t2)
  update_cfs_shares(group_cfs_rq(b[0]))
  < skipped because b is part of a throttled hierarchy >
  enqueue_entity(group_cfs_rq(a[0]), b[0])
  ...

We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
which violates invariants in several code-paths. Eliminate the
possibility of this by initializing group entity weight.

Change-Id: I1b09c5c4f85df1de565e7c862ba8dde65851c05f
Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Chris J Arges <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Fix isolated CPUs leaving root_task_group indefinitely throttled

Root task group bandwidth replenishment must service all CPUs, regardless of
where the timer was last started, and regardless of the isolation mechanism,
lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.

Change-Id: I8485724e3ff0960008aa13634e55bcb6ceacbc01
Signed-off-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: rt: Avoid updating RT entry timeout twice within one tick period

The issue below was found in 2.6.34-rt rather than mainline rt kernel, but
the issue still exists upstream as well.

So please let me describe how it was noticed on 2.6.34-rt:

On this version, each softirq has its own thread, which means there is at
least one RT FIFO task per cpu.  The priority of these tasks is set to 49
by default.  If user launches an RT FIFO task with priority lower than 49
of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued
one cpu runqueue at one moment. By current strategy of balancing RT tasks,
when it comes to RT tasks, we really need to put them off to a CPU that
they can run on as soon as possible. Even if it means a bit of cache line
flushing, we want RT tasks to be run with the least latency.

When the user RT FIFO task which just launched before is running, the
sched timer tick of the current cpu happens. In this tick period, the
timeout value of the user RT task will be updated once. Subsequently,
we try to wake up one softirq RT task on its local cpu. As the priority
of current user RT task is lower than the softirq RT task, the current
task will be preempted by the higher priority softirq RT task.  Before
preemption, we check to see if current can readily move to a different
cpu. If so, we will reschedule to allow the RT push logic to try to move
current somewhere else. Whenever the woken softirq RT task runs, it first
tries to migrate the user FIFO RT task over to a cpu that is running a
task of lesser priority. If migration is done, it will send a reschedule
request to the found cpu by IPI interrupt. Once the target cpu responds
the IPI interrupt, it will pick the migrated user RT task to preempt its
current task. When the user RT task is running on the new cpu, the sched
timer tick of the cpu fires. So it will tick the user RT task again. This
also means the RT task timeout value will be updated again.  As the
migration may be done in one tick period, it means the user RT task
timeout value will be updated twice within one tick.

If we set a limit on the amount of cpu time for the user RT task by
setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon
reaching the soft limit.

But exactly when the SIGXCPU signal should be sent depends on the RT task
timeout value. In fact the timeout mechanism of sending the SIGXCPU signal
assumes the RT task timeout is increased once every tick.

However, currently the timeout value may be added twice per tick. So it
results in the SIGXCPU signal being sent earlier than expected.

To solve this issue, we prevent the timeout value from increasing twice
within one tick time by remembering the jiffies value of last updating the
timeout. As long as the RT task's jiffies is different with the global
jiffies value, we allow its timeout to be updated.

Signed-off-by: Ying Xue <[email protected]>
Signed-off-by: Fan Du <[email protected]>
Reviewed-by: Yong Zhang <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
[ lizf: backported to 3.4: adjust context ]
Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Change-Id: I1d126959d88f3bc1adb4a707661efb7162048424

sched: rt: Reduce rq lock contention by eliminating locking of non-feasible target

This patch adds checks that prevens futile attempts to move rt tasks to a
CPU with active tasks of equal or higher priority.

This reduces run queue lock contention and improves the performance of a
well known OLTP benchmark by 0.7%.

Change-Id: I232abe50623593d51be508fc64e8b865dcc77212
Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Shawn Bohrer <[email protected]>
Cc: Suruchi Kadu <[email protected]>
Cc: Doug Nelson<[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

sched: Fix RLIMIT_RTTIME when PI-boosting to RT

When non-realtime tasks get priority-inheritance boosted to a realtime
scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
counter used for checking this (the same one used for SCHED_RR
timeslices) was not getting reset. This meant that tasks running with a
non-realtime scheduling class which are repeatedly boosted to a realtime
one, but never block while they are running realtime, eventually hit the
timeout without ever running for a time over the limit. This patch
resets the realtime timeslice counter when un-PI-boosting from an RT to
a non-RT scheduling class.

I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
mutex which induces priority boosting and spins while boosted that gets
killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
does happen eventually with PREEMPT_VOLUNTARY kernels.

Change-Id: I732036119517d9a9625cea4644b4509c1d6cdf13
Signed-off-by: Brian Silverman <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
[lizf: Backported to 3.4: adjust contest]
Signed-off-by: Zefan Li <[email protected]>

sched: Queue RT tasks to head when prio drops

The following scenario does not work correctly:

Runqueue of CPUx contains two runnable and pinned tasks:

 T1: SCHED_FIFO, prio 80
 T2: SCHED_FIFO, prio 80

T1 is on the cpu and executes the following syscalls (classic priority
ceiling scenario):

 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90);
 ...
 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80);
 ...

Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back
to sleep the scheduler picks T2. Surprise!

The same happens w/o actual preemption when T1 is forced into the
scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes
pick_next_task() which returns T2. So T1 gets preempted and scheduled
out.

This happens because sched_setscheduler() dequeues T1 from the prio 90
list and then enqueues it on the tail of the prio 80 list behind T2.
This violates the POSIX spec and surprises user space which relies on
the guarantee that SCHED_FIFO tasks are not scheduled out unless they
give the CPU up voluntarily or are preempted by a higher priority
task. In the latter case the preempted task must get back on the CPU
after the preempting task schedules out again.

We fixed a similar issue already in commit 60db48c (sched: Queue a
deboosted task to the head of the RT prio queue). The same treatment
is necessary for sched_setscheduler(). So enqueue to head of the prio
bucket list if the priority of the task is lowered.

It might be possible that existing user space relies on the current
behaviour, but it can be considered highly unlikely due to the corner
case nature of the application scenario.

Change-Id: I65f159b3c2d265dcaf27b70bf9be88cf922ab845
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

sched: Fix the theoretical signal_wake_up() vs schedule() race

This is only theoretical, but after try_to_wake_up(p) was changed to check
p->state under p->pi_lock the code like:

	__set_current_state(TASK_INTERRUPTIBLE);
	schedule();

can miss a signal.  This is the special case of wait-for-condition, it
relies on try_to_wake_up/schedule interaction and thus it does not need
mb() between __set_current_state() and if (signal_pending).

However, this __set_current_state() can move into the critical section
protected by rq->lock, now that try_to_wake_up() takes another lock we
need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.

The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock).  This is what try_to_wake_up() already
does by the same reason.

We turn this wmb() into the new helper, smp_mb__before_spinlock(), for
better documentation and to allow the architectures to change the default
implementation.

While at it, kill smp_mb__after_lock(), it has no callers.

Perhaps we can also add smp_mb__before/after_spinunlock() for
prepare_to_wait().

Change-Id: I2b329b2ae213cc9221811ca244d2ebb044011a1b
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix race in idle_balance()

The scheduler main function 'schedule()' checks if there are no more tasks
on the runqueue. Then it checks if a task should be pulled in the current
runqueue in idle_balance() assuming it will go to idle otherwise.

But idle_balance() releases the rq->lock in order to look up the sched
domains and takes the lock again right after. That opens a window where
another cpu may put a task in our runqueue, so we won't go to idle but
we have filled the idle_stamp, thinking we will.

This patch closes the window by checking if the runqueue has been modified
but without pulling a task after taking the lock again, so we won't go to idle
right after in the __schedule() function.

Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8
Signed-off-by: Daniel Lezcano <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix load avg vs cpu-hotplug

Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.

Reported-by: Rakib Mullick <[email protected]>
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>

sched: Fix load avg vs. cpu-hotplug

Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an
incomplete fix:

In particular, the problem is that at the point it calls
calc_load_migrate() nr_running := 1 (the stopper thread), so move the
call to CPU_DEAD where we're sure that nr_running := 0.

Also note that we can call calc_load_migrate() without serialization, we
know the state of rq is stable since its cpu is dead, and we modify the
global state using appropriate atomic ops.

Change-Id: I24063eebf4429d7e01838af2cb68e0f3c2cf2131
Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>

sched/nohz: Rewrite and fix load-avg computation -- again

Thanks to Charles Wang for spotting the defects in the current code:

 - If we go idle during the sample window -- after sampling, we get a
   negative bias because we can negate our own sample.

 - If we wake up during the sample window we get a positive bias
   because we push the sample to a known active period.

So rewrite the entire nohz load-avg muck once again, now adding
copious documentation to the code.

Reported-and-tested-by: Doug Smythies <[email protected]>
Reported-and-tested-by: Charles Wang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
[ minor edits ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Change-Id: Ib735002c79eaace9fa5f7795fdc2f80b638b56a7

sched: Fix the broken sched_rr_get_interval()

The caller of sched_sliced() should pass se.cfs_rq and se as the
arguments, however in sched_rr_get_interval() we gave it
rq.cfs_rq and se, which made the following computation obviously
wrong.

The change was introduced by commit:

  77034937dc45 sched: fix crash in sys_sched_rr_get_interval()

... 5 years ago, while it had been the correct 'cfs_rq_of' before
the commit. The change seems to be irrelevant to the commit
msg, which was to return a 0 timeslice for tasks that are on an
idle runqueue. So I believe that was just a plain typo.

Signed-off-by: Zhu Yanhai <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Since this is an ABI and an old bug, we'll test this via a
  slow upstream route, to hopefully discover any app breakage. ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched: Fix race in task_group()

Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call
task_group() too many times in set_task_rq()").  He found the reason to be
that the multiple task_group() invocations in set_task_rq() returned
different values.

Looking at all that I found a lack of serialization and plain wrong
comments.

The below tries to fix it using an extra pointer which is updated under
the appropriate scheduler locks. Its not pretty, but I can't really see
another way given how all the cgroup stuff works.

Change-Id: I6906c92ea6f66d19af5938a9f5ec9f1ff4577c39
Reported-and-tested-by: Stefan Bader <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume

In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.

However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.

In order to achieve that, do the following:

1. Don't modify cpusets during suspend/resume. At all.
   In particular, don't move the tasks from one cpuset to another, and
   don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
   during the CPU hotplug operations that are carried out in the
   suspend/resume path.

2. However, cpusets and sched domains are related. We just want to avoid
   altering cpusets alone. So, to keep the sched domains updated, build
   a single sched domain (containing all active cpus) during each of the
   CPU hotplug operations carried out in s/r path, effectively ignoring
   the cpusets' cpus_allowed masks.

   (Since userspace is frozen while doing all this, it will go unnoticed.)

3. During the last CPU online operation during resume, build the sched
   domains by looking up the (unaltered) cpusets' cpus_allowed masks.
   That will bring back the system to the same original state as it was in
   before suspend.

Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.

Change-Id: I12c3e6289d5aa069531f8f616d63412b38103d99
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Preeti U Murthy <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones

There is a small race between copy_process() and cgroup_attach_task()
where child->se.parent,cfs_rq points to invalid (old) ones.

        parent doing fork()      | someone moving the parent to another cgroup
  -------------------------------+---------------------------------------------
    copy_process()
      + dup_task_struct()
        -> parent->se is copied to child->se.
           se.parent,cfs_rq of them point to old ones.

                                     cgroup_attach_task()
                                       + cgroup_task_migrate()
                                         -> parent->cgroup is updated.
                                       + cpu_cgroup_attach()
                                         + sched_move_task()
                                           + task_move_group_fair()
                                             +- set_task_rq()
                                                -> se.parent,cfs_rq of parent
                                                   are updated.

      + cgroup_fork()
        -> parent->cgroup is copied to child->cgroup. (*1)
      + sched_fork()
        + task_fork_fair()
          -> se.parent,cfs_rq of child are accessed
             while they point to old ones. (*2)

In the worst case, this bug can lead to "use-after-free" and cause a panic,
because it's new cgroup's refcount that is incremented at (*1),
so the old cgroup(and related data) can be freed before (*2).

In fact, a panic caused by this bug was originally caught in RHEL6.4.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0
    [...]
    Call Trace:
     [<ffffffff81051f25>] place_entity+0x75/0xa0
     [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160
     [<ffffffff81063c0b>] sched_fork+0x6b/0x140
     [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450
     [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130
     [<ffffffff8106d2f4>] do_fork+0x94/0x460
     [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100
     [<ffffffff81009598>] sys_clone+0x28/0x30
     [<ffffffff8100b393>] stub_clone+0x13/0x20
     [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

Signed-off-by: Daisuke Nishimura <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/nohz: Fix rq->cpu_load[] calculations

While investigating why the load-balancer did funny I found that the
rq->cpu_load[] tables were completely screwy.. a bit more digging
revealed that the updates that got through were missing ticks followed
by a catchup of 2 ticks.

The catchup assumes the cpu was idle during that time (since only nohz
can cause missed ticks and the machine is idle etc..) this means that
esp. the higher indices were significantly lower than they ought to
be.

The reason for this is that its not correct to compare against jiffies
on every jiffy on any other cpu than the cpu that updates jiffies.

This patch cludges around it by only doing the catch-up stuff from
nohz_idle_balance() and doing the regular stuff unconditionally from
the tick.

Change-Id: Iac6eda1ded5074afb261ccc0fe8a4e70432bb031
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Venkatesh Pallipadi <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/nohz: Fix rq->cpu_load calculations some more

commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream.

Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[]
calculations") since while that fixed the busy case it regressed the
mostly idle case.

Add a callback from the nohz exit to also age the rq->cpu_load[]
array. This closes the hole where either there was no nohz load
balance pass during the nohz, or there was a 'significant' amount of
idle time between the last nohz balance and the nohz exit.

So we'll update unconditionally from the tick to not insert any
accidental 0 load periods while busy, and we try and catch up from
nohz idle balance and nohz exit. Both these are still prone to missing
a jiffy, but that has always been the case.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Venkatesh Pallipadi <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

sched/core: Clear the root_domain cpumasks in init_rootdomain()

root_domain::rto_mask allocated through alloc_cpumask_var() contains
garbage data, and this may cause problems.  For instance, when doing
pull_rt_task(), it may do useless iterations if rto_mask retains some
extra garbage bits. Worse still, this violates the isolated domain rule
for clustered scheduling using cpuset, because the tasks (with all the
cpus allowed) belongs to one root domain can be pulled away into another
root domain.

The patch cleans the garbage by using zalloc_cpumask_var() instead of
alloc_cpumask_var() for root_domain::rto_mask allocation, thereby
addressing the issues.

Do the same thing for root_domain's other cpumask memembers:
- span,
- online;

Signed-off-by: Xunlei Pang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

softirq: Reduce latencies

In various network workloads, __do_softirq() latencies can be up to 20 ms
if HZ=1000, and 200 ms if HZ=100.

This is because we iterate 10 times in the softirq dispatcher, and some
actions can consume a lot of cycles.

This patch changes the fallback to ksoftirqd condition to:
- A time limit of 2 ms,
- need_resched() being set on current task;

When one of this condition is met, we wakeup ksoftirqd for further softirq
processing if we still have pending softirqs.

Using need_resched() as the only condition can trigger RCU stalls, as we
can keep BH disabled for too long.

I ran several benchmarks and got no significant difference in throughput,
but a very significant reduction of latencies (one order of magnitude):

In following bench, 200 antagonist "netperf -t TCP_RR" are started in
background, using all available cpus.

Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC
IRQ (hard + soft).

Before patch:

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=550110.424
MIN_LATENCY=146858
MAX_LATENCY=997109
P50_LATENCY=305000
P90_LATENCY=550000
P99_LATENCY=710000
MEAN_LATENCY=376989.12
STDDEV_LATENCY=184046.92

After patch:

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=40545.492
MIN_LATENCY=9834
MAX_LATENCY=78366
P50_LATENCY=33583
P90_LATENCY=59000
P99_LATENCY=69000
MEAN_LATENCY=38364.67
STDDEV_LATENCY=12865.26

Signed-off-by: Eric Dumazet <[email protected]>
Cc: David Miller <[email protected]>
Cc: Tom Herbert <[email protected]>
Cc: Ben Hutchings <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <[email protected]>
Signed-off-by: Zefan Li <[email protected]>

softirq: Fix lockup related to stop_machine being stuck in __do_softirq.

The stop machine logic can lock up if all but one of the migration threads
make it through the disable-irq step and the one remaining thread gets
stuck in __do_softirq. The reason __do_softirq can hang is that it has a
bail-out based on jiffies timeout, but in the lockup case, jiffies itself
is not incremented.

To work around this, re-add the max_restart counter in __do_irq and stop
processing irqs after 10 restarts.

Thanks to Tejun Heo and Rusty Russell and others for helping me track this
down.

This was introduced in 3.9 by commit c10d73671ad3
("softirq: reduce latencies").

It may be worth looking into ath9k to see if it has issues with its irq
handler at a later date.

The hang stack traces look something like this:

    ------------[ cut here ]------------
    WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7()
    Watchdog detected hard LOCKUP on cpu 2
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    Pid: 23, comm: migration/2 Tainted: G         C   3.9.4+ #11
    Call Trace:
     <NMI>   warn_slowpath_common+0x85/0x9f
      warn_slowpath_fmt+0x46/0x48
      watchdog_overflow_callback+0x9c/0xa7
      __perf_event_overflow+0x137/0x1cb
      perf_event_overflow+0x14/0x16
      intel_pmu_handle_irq+0x2dc/0x359
      perf_event_nmi_handler+0x19/0x1b
      nmi_handle+0x7f/0xc2
      do_nmi+0xbc/0x304
      end_repeat_nmi+0x1e/0x2e
     <<EOE>>
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0
    ---[ end trace 4947dfa9b0a4cec3 ]---
    BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17]
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    irq event stamp: 835637905
    hardirqs last  enabled at (835637904): __do_softirq+0x9f/0x257
    hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80
    softirqs last  enabled at (5654720): __do_softirq+0x1ff/0x257
    softirqs last disabled at (5654725): irq_exit+0x5f/0xbb
    CPU 1
    Pid: 17, comm: migration/1 Tainted: G        WC   3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
    RIP: tasklet_hi_action+0xf0/0xf0
    Process migration/1
    Call Trace:
     <IRQ>
      __do_softirq+0x117/0x257
      irq_exit+0x5f/0xbb
      smp_apic_timer_interrupt+0x8a/0x98
      apic_timer_interrupt+0x72/0x80
     <EOI>
      printk+0x4d/0x4f
      stop_machine_cpu_stop+0x22c/0x274
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0

Change-Id: Ib213d5a5c59c85ae16afde6f42a01e39d2c93fd7
Signed-off-by: Ben Greear <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Pekka Riikonen <[email protected]>
Cc: Eric Dumazet <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <[email protected]>
Signed-off-by: Zefan Li <[email protected]>
alesaiko pushed a commit that referenced this issue Sep 30, 2018
ARM has some private syscalls (for example, set_tls(2)) which lie
outside the range of NR_syscalls.  If any of these are called while
syscall tracing is being performed, out-of-bounds array access will
occur in the ftrace and perf sys_{enter,exit} handlers.

 # trace-cmd record -e raw_syscalls:* true && trace-cmd report
 ...
 true-653   [000]   384.675777: sys_enter:            NR 192 (0, 1000, 3, 4000022, ffffffff, 0)
 true-653   [000]   384.675812: sys_exit:             NR 192 = 1995915264
 true-653   [000]   384.675971: sys_enter:            NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1)
 true-653   [000]   384.675988: sys_exit:             NR 983045 = 0
 ...

 # trace-cmd record -e syscalls:* true
 [   17.289329] Unable to handle kernel paging request at virtual address aaaaaace
 [   17.289590] pgd = 9e71c000
 [   17.289696] [aaaaaace] *pgd=00000000
 [   17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
 [   17.290169] Modules linked in:
 [   17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21
 [   17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000
 [   17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8
 [   17.290866] LR is at syscall_trace_enter+0x124/0x184

Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers.

Commit cd0980f "tracing: Check invalid syscall nr while tracing syscalls"
added the check for less than zero, but it should have also checked
for greater than NR_syscalls.

Link: http://lkml.kernel.org/p/[email protected]

Fixes: cd0980f "tracing: Check invalid syscall nr while tracing syscalls"
Cc: [email protected] # 2.6.33+
Signed-off-by: Rabin Vincent <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>

Change-Id: I512142f8f1e1b2a8dc063209666dbce9737377e7
alesaiko pushed a commit that referenced this issue Sep 30, 2018
If a user key gets negatively instantiated, an error code is cached in the
payload area.  A negatively instantiated key may be then be positively
instantiated by updating it with valid data.  However, the ->update key
type method must be aware that the error code may be there.

The following may be used to trigger the bug in the user key type:

    keyctl request2 user user "" @U
    keyctl add user user "a" @U

which manifests itself as:

	BUG: unable to handle kernel paging request at 00000000ffffff8a
	IP: [<ffffffff810a376f>] __call_rcu.constprop.76+0x1f/0x280 kernel/rcu/tree.c:3046
	PGD 7cc30067 PUD 0
	Oops: 0002 [#1] SMP
	Modules linked in:
	CPU: 3 PID: 2644 Comm: a.out Not tainted 4.3.0+ #49
	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
	task: ffff88003ddea700 ti: ffff88003dd88000 task.ti: ffff88003dd88000
	RIP: 0010:[<ffffffff810a376f>]  [<ffffffff810a376f>] __call_rcu.constprop.76+0x1f/0x280
	 [<ffffffff810a376f>] __call_rcu.constprop.76+0x1f/0x280 kernel/rcu/tree.c:3046
	RSP: 0018:ffff88003dd8bdb0  EFLAGS: 00010246
	RAX: 00000000ffffff82 RBX: 0000000000000000 RCX: 0000000000000001
	RDX: ffffffff81e3fe40 RSI: 0000000000000000 RDI: 00000000ffffff82
	RBP: ffff88003dd8bde0 R08: ffff88007d2d2da0 R09: 0000000000000000
	R10: 0000000000000000 R11: ffff88003e8073c0 R12: 00000000ffffff82
	R13: ffff88003dd8be68 R14: ffff88007d027600 R15: ffff88003ddea700
	FS:  0000000000b92880(0063) GS:ffff88007fd00000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
	CR2: 00000000ffffff8a CR3: 000000007cc5f000 CR4: 00000000000006e0
	Stack:
	 ffff88003dd8bdf0 ffffffff81160a8a 0000000000000000 00000000ffffff82
	 ffff88003dd8be68 ffff88007d027600 ffff88003dd8bdf0 ffffffff810a39e5
	 ffff88003dd8be20 ffffffff812a31ab ffff88007d027600 ffff88007d027620
	Call Trace:
	 [<ffffffff810a39e5>] kfree_call_rcu+0x15/0x20 kernel/rcu/tree.c:3136
	 [<ffffffff812a31ab>] user_update+0x8b/0xb0 security/keys/user_defined.c:129
	 [<     inline     >] __key_update security/keys/key.c:730
	 [<ffffffff8129e5c1>] key_create_or_update+0x291/0x440 security/keys/key.c:908
	 [<     inline     >] SYSC_add_key security/keys/keyctl.c:125
	 [<ffffffff8129fc21>] SyS_add_key+0x101/0x1e0 security/keys/keyctl.c:60
	 [<ffffffff8185f617>] entry_SYSCALL_64_fastpath+0x12/0x6a arch/x86/entry/entry_64.S:185

Note the error code (-ENOKEY) in EDX.

A similar bug can be tripped by:

    keyctl request2 trusted user "" @U
    keyctl add trusted user "a" @U

This should also affect encrypted keys - but that has to be correctly
parameterised or it will fail with EINVAL before getting to the bit that
will crashes.

Change-Id: I171d566f431c56208e1fe279f466d2d399a9ac7c
Reported-by: Dmitry Vyukov <[email protected]>
Signed-off-by: David Howells <[email protected]>
Acked-by: Mimi Zohar <[email protected]>
Signed-off-by: James Morris <[email protected]>
alesaiko pushed a commit that referenced this issue Sep 30, 2018
An issue was observed when a userspace task exits.
The page which hits error here is the zero page.
In binder mmap, the whole of vma is not mapped.
On a task crash, when debuggerd reads the binder regions,
the unmapped areas fall to do_anonymous_page in handle_pte_fault,
due to the absence of a vm_fault handler. This results in
zero page being mapped. Later in zap_pte_range, vm_normal_page
returns zero page in the case of VM_MIXEDMAP and it results in the
error.

BUG: Bad page map in process mediaserver  pte:9dff379f pmd:9bfbd831
page:c0ed8e60 count:1 mapcount:-1 mapping:  (null) index:0x0
page flags: 0x404(referenced|reserved)
addr:40c3f000 vm_flags:10220051 anon_vma:  (null) mapping:d9fe0764 index:fd
vma->vm_ops->fault:   (null)
vma->vm_file->f_op->mmap: binder_mmap+0x0/0x274
CPU: 0 PID: 1463 Comm: mediaserver Tainted: G        W    3.10.17+ #1
[<c001549c>] (unwind_backtrace+0x0/0x11c) from [<c001200c>] (show_stack+0x10/0x14)
[<c001200c>] (show_stack+0x10/0x14) from [<c0103d78>] (print_bad_pte+0x158/0x190)
[<c0103d78>] (print_bad_pte+0x158/0x190) from [<c01055f0>] (unmap_single_vma+0x2e4/0x598)
[<c01055f0>] (unmap_single_vma+0x2e4/0x598) from [<c010618c>] (unmap_vmas+0x34/0x50)
[<c010618c>] (unmap_vmas+0x34/0x50) from [<c010a9e4>] (exit_mmap+0xc8/0x1e8)
[<c010a9e4>] (exit_mmap+0xc8/0x1e8) from [<c00520f0>] (mmput+0x54/0xd0)
[<c00520f0>] (mmput+0x54/0xd0) from [<c005972c>] (do_exit+0x360/0x990)
[<c005972c>] (do_exit+0x360/0x990) from [<c0059ef0>] (do_group_exit+0x84/0xc0)
[<c0059ef0>] (do_group_exit+0x84/0xc0) from [<c0066de0>] (get_signal_to_deliver+0x4d4/0x548)
[<c0066de0>] (get_signal_to_deliver+0x4d4/0x548) from [<c0011500>] (do_signal+0xa8/0x3b8)

Add a vm_fault handler which returns VM_FAULT_SIGBUS, and prevents the
wrong fallback to do_anonymous_page.

Signed-off-by: Vinayak Menon <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
alesaiko pushed a commit that referenced this issue Sep 30, 2018
Fix a short sprintf buffer in proc_keys_show().  If the gcc stack protector
is turned on, this can cause a panic due to stack corruption.

The problem is that xbuf[] is not big enough to hold a 64-bit timeout
rendered as weeks:

	(gdb) p 0xffffffffffffffffULL/(60*60*24*7)
	$2 = 30500568904943

That's 14 chars plus NUL, not 11 chars plus NUL.

Expand the buffer to 16 chars.

I think the unpatched code apparently works if the stack-protector is not
enabled because on a 32-bit machine the buffer won't be overflowed and on a
64-bit machine there's a 64-bit aligned pointer at one side and an int that
isn't checked again on the other side.

The panic incurred looks something like:

Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff81352ebe
CPU: 0 PID: 1692 Comm: reproducer Not tainted 4.7.2-201.fc24.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 0000000000000086 00000000fbbd2679 ffff8800a044bc00 ffffffff813d941f
 ffffffff81a28d58 ffff8800a044bc98 ffff8800a044bc88 ffffffff811b2cb6
 ffff880000000010 ffff8800a044bc98 ffff8800a044bc30 00000000fbbd2679
Call Trace:
 [<ffffffff813d941f>] dump_stack+0x63/0x84
 [<ffffffff811b2cb6>] panic+0xde/0x22a
 [<ffffffff81352ebe>] ? proc_keys_show+0x3ce/0x3d0
 [<ffffffff8109f7f9>] __stack_chk_fail+0x19/0x30
 [<ffffffff81352ebe>] proc_keys_show+0x3ce/0x3d0
 [<ffffffff81350410>] ? key_validate+0x50/0x50
 [<ffffffff8134db30>] ? key_default_cmp+0x20/0x20
 [<ffffffff8126b31c>] seq_read+0x2cc/0x390
 [<ffffffff812b6b12>] proc_reg_read+0x42/0x70
 [<ffffffff81244fc7>] __vfs_read+0x37/0x150
 [<ffffffff81357020>] ? security_file_permission+0xa0/0xc0
 [<ffffffff81246156>] vfs_read+0x96/0x130
 [<ffffffff81247635>] SyS_read+0x55/0xc0
 [<ffffffff817eb872>] entry_SYSCALL_64_fastpath+0x1a/0xa4

Change-Id: I0787d5a38c730ecb75d3c08f28f0ab36295d59e7
Reported-by: Ondrej Kozina <[email protected]>
Signed-off-by: David Howells <[email protected]>
Tested-by: Ondrej Kozina <[email protected]>
alesaiko pushed a commit that referenced this issue Sep 30, 2018
There is a race between the TX path and the STA wakeup: while
a station is sleeping, mac80211 buffers frames until it wakes
up, then the frames are transmitted. However, the RX and TX
path are concurrent, so the packet indicating wakeup can be
processed while a packet is being transmitted.

This can lead to a situation where the buffered frames list
is emptied on the one side, while a frame is being added on
the other side, as the station is still seen as sleeping in
the TX path.

As a result, the newly added frame will not be send anytime
soon. It might be sent much later (and out of order) when the
station goes to sleep and wakes up the next time.

Additionally, it can lead to the crash below.

Fix all this by synchronising both paths with a new lock.
Both path are not fastpath since they handle PS situations.

In a later patch we'll remove the extra skb queue locks to
reduce locking overhead.

BUG: unable to handle kernel
NULL pointer dereference at 000000b0
IP: [<ff6f1791>] ieee80211_report_used_skb+0x11/0x3e0 [mac80211]
*pde = 00000000
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
EIP: 0060:[<ff6f1791>] EFLAGS: 00210282 CPU: 1
EIP is at ieee80211_report_used_skb+0x11/0x3e0 [mac80211]
EAX: e5900da0 EBX: 00000000 ECX: 00000001 EDX: 00000000
ESI: e41d00c0 EDI: e5900da0 EBP: ebe458e4 ESP: ebe458b0
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 8005003b CR2: 000000b0 CR3: 25a78000 CR4: 000407d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
Process iperf (pid: 3934, ti=ebe44000 task=e757c0b0 task.ti=ebe44000)
iwlwifi 0000:02:00.0: I iwl_pcie_enqueue_hcmd Sending command LQ_CMD (#4e), seq: 0x0903, 92 bytes at 3[3]:9
Stack:
 e403b32c ebe458c4 00200002 00200286 e403b338 ebe458cc c10960bb e5900da0
 ff76a6ec ebe458d8 00000000 e41d00c0 e5900da0 ebe458f0 ff6f1b75 e403b210
 ebe4598c ff723dc1 00000000 ff76a6ec e597c978 e403b758 00000002 00000002
Call Trace:
 [<ff6f1b75>] ieee80211_free_txskb+0x15/0x20 [mac80211]
 [<ff723dc1>] invoke_tx_handlers+0x1661/0x1780 [mac80211]
 [<ff7248a5>] ieee80211_tx+0x75/0x100 [mac80211]
 [<ff7249bf>] ieee80211_xmit+0x8f/0xc0 [mac80211]
 [<ff72550e>] ieee80211_subif_start_xmit+0x4fe/0xe20 [mac80211]
 [<c149ef70>] dev_hard_start_xmit+0x450/0x950
 [<c14b9aa9>] sch_direct_xmit+0xa9/0x250
 [<c14b9c9b>] __qdisc_run+0x4b/0x150
 [<c149f732>] dev_queue_xmit+0x2c2/0xca0

Cc: [email protected]
Reported-by: Yaara Rozenblum <[email protected]>
Signed-off-by: Emmanuel Grumbach <[email protected]>
Reviewed-by: Stanislaw Gruszka <[email protected]>
[reword commit log, use a separate lock]
Signed-off-by: Johannes Berg <[email protected]>

CVE-2014-2706

Change-Id: I555e82d71add385f300ecc20af5d5d8b69246c52
(cherry picked from commit 1d147bfa64293b2723c4fec50922168658e613ba)
alesaiko pushed a commit that referenced this issue Sep 30, 2018
This fixes CVE-2016-8650.

If mpi_powm() is given a zero exponent, it wants to immediately return
either 1 or 0, depending on the modulus.  However, if the result was
initalised with zero limb space, no limbs space is allocated and a
NULL-pointer exception ensues.

Fix this by allocating a minimal amount of limb space for the result when
the 0-exponent case when the result is 1 and not touching the limb space
when the result is 0.

This affects the use of RSA keys and X.509 certificates that carry them.

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff8138ce5d>] mpi_powm+0x32/0x7e6
PGD 0
Oops: 0002 [#1] SMP
Modules linked in:
CPU: 3 PID: 3014 Comm: keyctl Not tainted 4.9.0-rc6-fscache+ #278
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
task: ffff8804011944c0 task.stack: ffff880401294000
RIP: 0010:[<ffffffff8138ce5d>]  [<ffffffff8138ce5d>] mpi_powm+0x32/0x7e6
RSP: 0018:ffff880401297ad8  EFLAGS: 00010212
RAX: 0000000000000000 RBX: ffff88040868bec0 RCX: ffff88040868bba0
RDX: ffff88040868b260 RSI: ffff88040868bec0 RDI: ffff88040868bee0
RBP: ffff880401297ba8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000047 R11: ffffffff8183b210 R12: 0000000000000000
R13: ffff8804087c7600 R14: 000000000000001f R15: ffff880401297c50
FS:  00007f7a7918c700(0000) GS:ffff88041fb80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000401250000 CR4: 00000000001406e0
Stack:
 ffff88040868bec0 0000000000000020 ffff880401297b00 ffffffff81376cd4
 0000000000000100 ffff880401297b10 ffffffff81376d12 ffff880401297b30
 ffffffff81376f37 0000000000000100 0000000000000000 ffff880401297ba8
Call Trace:
 [<ffffffff81376cd4>] ? __sg_page_iter_next+0x43/0x66
 [<ffffffff81376d12>] ? sg_miter_get_next_page+0x1b/0x5d
 [<ffffffff81376f37>] ? sg_miter_next+0x17/0xbd
 [<ffffffff8138ba3a>] ? mpi_read_raw_from_sgl+0xf2/0x146
 [<ffffffff8132a95c>] rsa_verify+0x9d/0xee
 [<ffffffff8132acca>] ? pkcs1pad_sg_set_buf+0x2e/0xbb
 [<ffffffff8132af40>] pkcs1pad_verify+0xc0/0xe1
 [<ffffffff8133cb5e>] public_key_verify_signature+0x1b0/0x228
 [<ffffffff8133d974>] x509_check_for_self_signed+0xa1/0xc4
 [<ffffffff8133cdde>] x509_cert_parse+0x167/0x1a1
 [<ffffffff8133d609>] x509_key_preparse+0x21/0x1a1
 [<ffffffff8133c3d7>] asymmetric_key_preparse+0x34/0x61
 [<ffffffff812fc9f3>] key_create_or_update+0x145/0x399
 [<ffffffff812fe227>] SyS_add_key+0x154/0x19e
 [<ffffffff81001c2b>] do_syscall_64+0x80/0x191
 [<ffffffff816825e4>] entry_SYSCALL64_slow_path+0x25/0x25
Code: 56 41 55 41 54 53 48 81 ec a8 00 00 00 44 8b 71 04 8b 42 04 4c 8b 67 18 45 85 f6 89 45 80 0f 84 b4 06 00 00 85 c0 75 2f 41 ff ce <49> c7 04 24 01 00 00 00 b0 01 75 0b 48 8b 41 18 48 83 38 01 0f
RIP  [<ffffffff8138ce5d>] mpi_powm+0x32/0x7e6
 RSP <ffff880401297ad8>
CR2: 0000000000000000
---[ end trace d82015255d4a5d8d ]---

Basically, this is a backport of a libgcrypt patch:

	http://git.gnupg.org/cgi-bin/gitweb.cgi?p=libgcrypt.git;a=patch;h=6e1adb05d290aeeb1c230c763970695f4a538526

Fixes: cdec9cb ("crypto: GnuPG based MPI lib - source files (part 1)")
Change-Id: Ib4ad146d7488281260dade0d8d1a4f54fa7bd6ae
Signed-off-by: Andrey Ryabinin <[email protected]>
Signed-off-by: David Howells <[email protected]>
cc: Dmitry Kasatkin <[email protected]>
cc: [email protected]
cc: [email protected]
Signed-off-by: James Morris <[email protected]>
alesaiko pushed a commit that referenced this issue Sep 30, 2018
The class of 4 n_hdls buf locks is the same because a single function
n_hdlc_buf_list_init is used to init all the locks. But since
flush_tx_queue takes n_hdlc->tx_buf_list.spinlock and then calls
n_hdlc_buf_put which takes n_hdlc->tx_free_buf_list.spinlock, lockdep
emits a warning:
=============================================
[ INFO: possible recursive locking detected ]
4.3.0-25.g91e30a7-default #1 Not tainted
---------------------------------------------
a.out/1248 is trying to acquire lock:
 (&(&list->spinlock)->rlock){......}, at: [<ffffffffa01fd020>] n_hdlc_buf_put+0x20/0x60 [n_hdlc]

but task is already holding lock:
 (&(&list->spinlock)->rlock){......}, at: [<ffffffffa01fdc07>] n_hdlc_tty_ioctl+0x127/0x1d0 [n_hdlc]

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&list->spinlock)->rlock);
  lock(&(&list->spinlock)->rlock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

2 locks held by a.out/1248:
 #0:  (&tty->ldisc_sem){++++++}, at: [<ffffffff814c9eb0>] tty_ldisc_ref_wait+0x20/0x50
 #1:  (&(&list->spinlock)->rlock){......}, at: [<ffffffffa01fdc07>] n_hdlc_tty_ioctl+0x127/0x1d0 [n_hdlc]
...
Call Trace:
...
 [<ffffffff81738fd0>] _raw_spin_lock_irqsave+0x50/0x70
 [<ffffffffa01fd020>] n_hdlc_buf_put+0x20/0x60 [n_hdlc]
 [<ffffffffa01fdc24>] n_hdlc_tty_ioctl+0x144/0x1d0 [n_hdlc]
 [<ffffffff814c25c1>] tty_ioctl+0x3f1/0xe40
...

Fix it by initializing the spin_locks separately. This removes also
reduntand memset of a freshly kzallocated space.

Change-Id: I32bc83c9e19953672857fe8182107772411d471a
Signed-off-by: Jiri Slaby <[email protected]>
Reported-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants