-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error 7 #1
Comments
This is firmware related, kernel itself has nothing to do with it. |
alesaiko
added a commit
that referenced
this issue
Jan 28, 2018
Revert "binder: Quiet binder" This reverts commit 390b8ce. Change-Id: I31dcf564946247d329bf14dc2880a331b9f05716 Signed-off-by: D. Ajay Dudani <[email protected]> introduce SIZE_MAX ULONG_MAX is often used to check for integer overflow when calculating allocation size. While ULONG_MAX happens to work on most systems, there is no guarantee that `size_t' must be the same size as `long'. This patch introduces SIZE_MAX, the maximum value of `size_t', to improve portability and readability for allocation size validation. Signed-off-by: Xi Wang <[email protected]> Acked-by: Alex Elder <[email protected]> Cc: David Airlie <[email protected]> Cc: Pekka Enberg <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> Staging: android: binder: Make task_get_unused_fd_flags function static Silence the following warning: drivers/staging/android/binder.c:368:5: warning: symbol 'task_get_unused_fd_flags' was not declared. Should it be static? Change-Id: Iacdae492c73d3b0399d2cf0d101943313082de0d Cc: Arve Hjønnevåg <[email protected]> Signed-off-by: Sachin Kamat <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> drivers: staging: android: binder.c: fix printk macros Change printk() messages to pr_* macros. Change-Id: Iaeddb5f0697bf25abc3d860cfdc431a0a7125d7f Signed-off-by: Sherwin Soltani <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> BACKPORT: Staging: android: binder: Fixed multi-line strings Changed all user visible multi-line strings to single line. Removed 'binder:' prefix on stings. Change-Id: I697fa4ee9741e2893f08062ca2256985f4977739 Signed-off-by: Anmol Sarma <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> staging: android: binder: modify struct binder_write_read to use size_t This change mirrors the userspace operation where struct binder_write_read members that specify the buffer size and consumed size are size_t elements. The patch also fixes the binder_thread_write() and binder_thread_read() functions prototypes to conform with the definition of binder_write_read. The changes do not affect existing 32bit ABI. Signed-off-by: Serban Constantinescu <[email protected]> Acked-by: Arve Hjønnevåg <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Git-commit: 397334fc2be6a7e2f77474bd2b24880efea007bf Git-repo: https://android.googlesource.com/kernel/common/ Change-Id: If606f0fe135ffc4a630dbf34d755f559c36ee62a Signed-off-by: Neeti Desai <[email protected]> Signed-off-by: Ajay Dudani <[email protected]> staging: android: binder: fix sparse warnings Fix sparse warnings by adding __user annotation to stucts. This patch fixes the the following sparse warnings: drivers/staging/android/binder.c:1343:76: warning: incorrect type in argument 2 (different address spaces) drivers/staging/android/binder.c:1343:76: expected void [noderef] <asn:1>*ptr drivers/staging/android/binder.c:1343:76: got void *binder drivers/staging/android/binder.c:1567:57: warning: incorrect type in argument 2 (different address spaces) drivers/staging/android/binder.c:1567:57: expected void const [noderef] <asn:1>*from drivers/staging/android/binder.c:1567:57: got void const *buffer drivers/staging/android/binder.c:1573:46: warning: incorrect type in argument 2 (different address spaces) drivers/staging/android/binder.c:1573:46: expected void const [noderef] <asn:1>*from drivers/staging/android/binder.c:1573:46: got void const *offsets drivers/staging/android/binder.c:1603:76: warning: incorrect type in argument 2 (different address spaces) drivers/staging/android/binder.c:1603:76: expected void [noderef] <asn:1>*ptr drivers/staging/android/binder.c:1603:76: got void *binder drivers/staging/android/binder.c:1605:64: warning: incorrect type in argument 2 (different address spaces) drivers/staging/android/binder.c:1605:64: expected void [noderef] <asn:1>*ptr drivers/staging/android/binder.c:1605:64: got void *binder drivers/staging/android/binder.c:1605:76: warning: incorrect type in argument 3 (different address spaces) drivers/staging/android/binder.c:1605:76: expected void [noderef] <asn:1>*cookie drivers/staging/android/binder.c:1605:76: got void *cookie drivers/staging/android/binder.c:1613:40: error: incompatible types in comparison Change-Id: I6cac879e87993f077700574420a798226a21721d Signed-off-by: Emil Goode <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> staging: android: Avoid using camelcase in binder.h This changes the following: 1: BinderDriverReturnProtocol -> binder_driver_return_protocol 2: BinderDriverCommandProtocol -> binder_driver_return_protocol These enums are not currently used, but still generate noise in checkpatch. Well, did. They don't now :) Change-Id: I7eeb7b8fc20ed1c4b3736f3f36b6637a1a631560 Signed-off-by: Cruz Julian Bishop <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> staging: android: binder: fix alignment issues The Android userspace aligns the data written to the binder buffers to 4bytes. Thus for 32bit platforms or 64bit platforms running an 32bit Android userspace we can have a buffer looking like this: platform buffer(binder_cmd pointer) size 32/32 32b 32b 8B 64/32 32b 64b 12B 64/64 32b 64b 12B Thus the kernel needs to check that the buffer size is aligned to 4bytes not to (void *) that will be 8bytes on 64bit machines. The change does not affect existing 32bit ABI. Change-Id: Idcad35da1c1567ee0676d60d03afd07b219c59ea Signed-off-by: Serban Constantinescu <[email protected]> Acked-by: Arve Hjønnevåg <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Git-commit: ec35e852dc9de9809f88ff397d7a611208880f9f Git-repo: https://android.googlesource.com/kernel/common/ Signed-off-by: Neeti Desai <[email protected]> Signed-off-by: Ajay Dudani <[email protected]> Staging: android: Mark local functions in binder.c as static This fixes the following sparse warnings drivers/staging/android/binder.c:1703:5: warning: symbol 'binder_thread_write' was not declared. Should it be static? drivers/staging/android/binder.c:2058:6: warning: symbol 'binder_stat_br' was not declared. Should it be static? Change-Id: I930f10e54c19b0c6aca275f3ef51320bcfa3bb34 Signed-off-by: Bojan Prtvar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Git-commit: fb07ebc3e82a98a3605112b71ea819c359549c4b Git-repo: https://android.googlesource.com/kernel/common/ Signed-off-by: Neeti Desai <[email protected]> Signed-off-by: Ajay Dudani <[email protected]> Staging: android: add __user annotation in binder.c This fixes the following sparse error drivers/staging/android/binder.c:1795:36: error: incompatible types in comparison expression (different address spaces) Change-Id: Icf6b3868442fc3e4f145c8d13de9626abc306cd1 Signed-off-by: Bojan Prtvar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Git-commit: 308fbd8ac0b0078dba29cad027e5b454aac13a6a Git-repo: https://android.googlesource.com/kernel/common/ Signed-off-by: Neeti Desai <[email protected]> Signed-off-by: Ajay Dudani <[email protected]> staging: android: binder: replace IOCTL types with user-exportable types This patch modifies the IOCTL macros to use user-exportable data types, as they are the referred kernel types for the user/kernel interface. The patch does not change in any way the functionality of the binder driver. Change-Id: I784358581eba5c04c9bb3235cd4ae68f0225129a Signed-off-by: Serban Constantinescu <[email protected]> Acked-by: Arve Hjønnevåg <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> staging: android: binder: fix BINDER_SET_MAX_THREADS declaration This change will fix the BINDER_SET_MAX_THREADS ioctl to use __u32 instead of size_t for setting the max threads. Thus using the same handler for 32 and 64bit kernels. This value is stored internally in struct binder_proc and set to 15 on open_binder() in the libbinder API(thus no need for a 64bit size_t on 64bit platforms). The change does not affect existing 32bit ABI. Change-Id: I193678d455b6527d54c524feb785631df8faed5a Signed-off-by: Serban Constantinescu <[email protected]> Acked-by: Arve Hjønnevåg <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Git-commit: a9350fc859ae3f1db7f2efd55c7a3e0d09a4098d Git-repo: https://android.googlesource.com/kernel/common/ Signed-off-by: Neeti Desai <[email protected]> Signed-off-by: Ajay Dudani <[email protected]> staging: android: binder: fix BC_FREE_BUFFER ioctl declaration BinderDriverCommands mirror the ioctl usage. Thus the size of the structure passed through the interface should be used to generate the ioctl No. The change reflects the type being passed from the user space-a pointer to a binder_buffer. This change should not affect the existing 32bit user space since BC_FREE_BUFFER is computed as: #define _IOW(type,nr,size) \ ((type) << _IOC_TYPESHIFT) | \ ((nr) << _IOC_NRSHIFT) | \ ((size) << _IOC_SIZESHIFT)) and for a 32bit compiler BC_FREE_BUFFER will have the same computed value. This change will also ease our work in differentiating BC_FREE_BUFFER from COMPAT_BC_FREE_BUFFER. The change does not affect existing 32bit ABI. Change-Id: I72c6bfae325840a825c8786a79a07ffad540d602 Signed-off-by: Serban Constantinescu <[email protected]> Acked-by: Arve Hjønnevåg <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Git-commit: fc56f2ecf091d774c18ad0d470c62c6818fa32a3 Git-repo: https://android.googlesource.com/kernel/common/ Signed-off-by: Neeti Desai <[email protected]> Signed-off-by: Ajay Dudani <[email protected]> staging: android: binder: replace types with portable ones Since this driver is meant to be used on different types of processors and a portable driver should specify the size a variable expects to be this patch changes the types used throughout the binder interface. We use "userspace" types since this header will be exported and used by the Android filesystem. The patch does not change in any way the functionality of the binder driver. Change-Id: Iede6575f6f9d76bec0bbed11948abe3ff081d0ee Signed-off-by: Serban Constantinescu <[email protected]> Acked-by: Arve Hjønnevåg <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Git-commit: eecddef594f9eb159040160b929642f16a07387f Git-repo: https://android.googlesource.com/kernel/common/ Signed-off-by: Neeti Desai <[email protected]> Signed-off-by: Ajay Dudani <[email protected]> staging: android: binder: fix binder interface for 64bit compat layer The changes in this patch will fix the binder interface for use on 64bit machines and stand as the base of the 64bit compat support. The changes apply to the structures that are passed between the kernel and userspace. Most of the changes applied mirror the change to struct binder_version where there is no need for a 64bit wide protocol_version(on 64bit machines). The change inlines with the existing 32bit userspace(the structure has the same size) and simplifies the compat layer such that the same handler can service the BINDER_VERSION ioctl. Other changes make use of kernel types as well as user-exportable ones and fix format specifier issues. The changes do not affect existing 32bit ABI. Change-Id: If00cb82dc4407a5e0890abbcb4019883e99e9a1f Signed-off-by: Serban Constantinescu <[email protected]> Acked-by: Arve Hjønnevåg <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Git-commit: 64dcfe6b84d4104d93e4baf2b5a0b3e7f2e4cc30 Git-repo: https://android.googlesource.com/kernel/common/ Signed-off-by: Neeti Desai <[email protected]> Signed-off-by: Ajay Dudani <[email protected]> drivers: staging: android: split uapi out of binder.h Move the userspace interface of binder.h to drivers/staging/android/uapi/binder.h. Change-Id: I2e56ba89ade5e1f33b121e6ecd456392d588a14e Signed-off-by: Colin Cross <[email protected]> Git-commit: 06f505a4d5719da00e76bde885792a7d5ec968f8 Git-repo: https://android.googlesource.com/kernel/common/ Signed-off-by: Neeti Desai <[email protected]> Signed-off-by: Ajay Dudani <[email protected]> staging: android: binder: fix ABI for 64bit Android This patch fixes the ABI for 64bit Android userspace. BC_REQUEST_DEATH_NOTIFICATION and BC_CLEAR_DEATH_NOTIFICATION claim to be using struct binder_ptr_cookie, but they are using a 32bit handle and a pointer. On 32bit systems the payload size is the same as the size of struct binder_ptr_cookie, however for 64bit systems this will differ. This patch adds struct binder_handle_cookie that fixes this issue for 64bit Android. Since there are no 64bit users of this interface that we know of this change should not affect any existing systems. Change-Id: I8909cbc50aad48ccf371270bad6f69ff242a8c22 Signed-off-by: Serban Constantinescu <[email protected]> Git-commit: 34d977e7af9bb097530aa71204d591485f7dddc7 Git-repo: https://android.googlesource.com/kernel/common/ Signed-off-by: Neeti Desai <[email protected]> Signed-off-by: David Ng <[email protected]> Signed-off-by: Ajay Dudani <[email protected]> binder: prevent kptr leak by using %pK format specifier Works in conjunction with kptr_restrict. Bug: 30143283 Change-Id: I2b3ce22f4e206e74614d51453a1d59b7080ab05a staging: binder: add vm_fault handler An issue was observed when a userspace task exits. The page which hits error here is the zero page. In binder mmap, the whole of vma is not mapped. On a task crash, when debuggerd reads the binder regions, the unmapped areas fall to do_anonymous_page in handle_pte_fault, due to the absence of a vm_fault handler. This results in zero page being mapped. Later in zap_pte_range, vm_normal_page returns zero page in the case of VM_MIXEDMAP and it results in the error. BUG: Bad page map in process mediaserver pte:9dff379f pmd:9bfbd831 page:c0ed8e60 count:1 mapcount:-1 mapping: (null) index:0x0 page flags: 0x404(referenced|reserved) addr:40c3f000 vm_flags:10220051 anon_vma: (null) mapping:d9fe0764 index:fd vma->vm_ops->fault: (null) vma->vm_file->f_op->mmap: binder_mmap+0x0/0x274 CPU: 0 PID: 1463 Comm: mediaserver Tainted: G W 3.10.17+ #1 [<c001549c>] (unwind_backtrace+0x0/0x11c) from [<c001200c>] (show_stack+0x10/0x14) [<c001200c>] (show_stack+0x10/0x14) from [<c0103d78>] (print_bad_pte+0x158/0x190) [<c0103d78>] (print_bad_pte+0x158/0x190) from [<c01055f0>] (unmap_single_vma+0x2e4/0x598) [<c01055f0>] (unmap_single_vma+0x2e4/0x598) from [<c010618c>] (unmap_vmas+0x34/0x50) [<c010618c>] (unmap_vmas+0x34/0x50) from [<c010a9e4>] (exit_mmap+0xc8/0x1e8) [<c010a9e4>] (exit_mmap+0xc8/0x1e8) from [<c00520f0>] (mmput+0x54/0xd0) [<c00520f0>] (mmput+0x54/0xd0) from [<c005972c>] (do_exit+0x360/0x990) [<c005972c>] (do_exit+0x360/0x990) from [<c0059ef0>] (do_group_exit+0x84/0xc0) [<c0059ef0>] (do_group_exit+0x84/0xc0) from [<c0066de0>] (get_signal_to_deliver+0x4d4/0x548) [<c0066de0>] (get_signal_to_deliver+0x4d4/0x548) from [<c0011500>] (do_signal+0xa8/0x3b8) Add a vm_fault handler which returns VM_FAULT_SIGBUS, and prevents the wrong fallback to do_anonymous_page. CRs-Fixed: 673147 Change-Id: I43730a51b6c819538b46c5e4dc5c96c8a384098d Signed-off-by: Vinayak Menon <[email protected]> Patch-mainline: linux-arm-kernel @ 06/02/14, 18:17 Signed-off-by: Vignesh Radhakrishnan <[email protected]> Signed-off-by: Subbaraman Narayanamurthy <[email protected]> Staging: android: binder: Support concurrent 32 bit and 64 bit processes. Add binder_size_t and binder_uintptr_t that is used instead of size_t and void __user * in the user-space interface. Use 64 bit pointers on all systems unless CONFIG_ANDROID_BINDER_IPC_32BIT is set (which enables the old protocol on 32 bit systems). Change BINDER_CURRENT_PROTOCOL_VERSION to 8 if CONFIG_ANDROID_BINDER_IPC_32BIT is not set. Add compat ioctl. Change-Id: Ifbbde0209da0050011bcab34c547a4c30d6e8c49 Signed-off-by: Arve Hjønnevåg <[email protected]> Git-commit: 1c4aa9fb12e8b0a54f056b8402b0bde61b49498f Git-repo: https://android.googlesource.com/kernel/common/ Signed-off-by: Neeti Desai <[email protected]> Signed-off-by: David Ng <[email protected]> Signed-off-by: Ajay Dudani <[email protected]> binder: Quiet Binder Temporary change to avoid watchdog bark because of excessive failed transaction logging CRs-Fixed: 572081 Change-Id: Id664d65ab9e78627991f8b7d4f4e5e126908c214 Signed-off-by: Uma Maheshwari Bhiram <[email protected]> binder: NULL pointer reference This fix handles a possible NULL pointer reference in debug message. CRs-fixed: 642883 Change-Id: Ide78f281ec0cff5cbd8231b85c305d13a892854e Signed-off-by: Paresh Nakhe <[email protected]> Staging: android: binder: More offset validation. Make sure offsets don't point to overlapping flat_binder_object structs. Change-Id: I425ab0c46fbe2b00ed679c5becf9e8140395eb40 Signed-off-by: Arve Hjønnevåg <[email protected]> Git-commit: 457c3cd05958b8397211ae1f6dd3c3d325f4c0ea Git-repo: https://android.googlesource.com/kernel/common/ Signed-off-by: Neeti Desai <[email protected]> Signed-off-by: Ajay Dudani <[email protected]> drivers: android: binder: Fix code style in binder_deferred_release * Use tabs where applicable * Remove a few "80-columns" checkpatch warnings * Separate code paths with empty lines for readability Change-Id: I634852d0812756e2c0412152a36c99dd9a9bb94a Signed-off-by: Mirsal Ennaime <[email protected]> Reviewed-by: Dan Carpenter <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> drivers: android: binder: Remove excessive indentation Remove one level of indentation from the binder proc page release code by using slightly different control semantics. Change-Id: I7a34049bf32799d7954da770f05411183c950778 Signed-off-by: Mirsal Ennaime <[email protected]> Reviewed-by: Dan Carpenter <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> drivers: android: binder: Use __func__ in debug messages Debug messages sent in binder_deferred_release begin with "binder_release:" which is a bit misleading as binder_release is not directly part of the call stack. Use __func__ instead for debug messages in binder_deferred_release. Change-Id: I7b9e2efaed188328d5b0dc82fbfe314a3666237c Signed-off-by: Mirsal Ennaime <[email protected]> Reviewed-by: Dan Carpenter <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> staging: android: Fix typo in staging/android Fix "with with" in debug message. Change-Id: Ibb60ca741d8ec760873054db53ad83e1b8a70c15 Signed-off-by: Masanari Iida <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Git-commit: 1dcdbfd6d9a5172ece7ccccbca90531d4cf62083 Git-repo: https://android.googlesource.com/kernel/common/ Signed-off-by: Neeti Desai <[email protected]> Signed-off-by: Ajay Dudani <[email protected]> binder: blacklist %p kptr_restrict Bug: 31495231 Change-Id: Iebc150f6bc939b56e021424ee44fb30ce8d732fd android: binder: split flat_binder_object. flat_binder_object is used for both handling binder objects and file descriptors, even though the two are mostly independent. Since we'll have more fixup objects in binder in the future, instead of extending flat_binder_object again, split out file descriptors to their own object while retaining backwards compatibility to existing user-space clients. All binder objects just share a header. Change-Id: Ifffa8cb749335d0ee79226c98f70786190516355 Signed-off-by: Martijn Coenen <[email protected]> BACKPORT: android: binder: support multiple context managers. Move the context manager state into a separate struct context, and allow for each process to have its own context associated with it. Change-Id: I6a9dfacb7b73a29760e367ff0b4e0ee21f2d0380 Signed-off-by: Martijn Coenen <[email protected]> Signed-off-by: Adrian DC <[email protected]> BACKPORT: android: binder: deal with contexts in debugfs. Properly print the context in debugfs entries. Change-Id: Ieeb89bfa8e760635366ce8b60569fbbd4937b844 Signed-off-by: Martijn Coenen <[email protected]> BACKPORT: android: binder: support multiple /dev instances. Add a new module parameter 'devices', that can be used to specify the names of the binder device nodes we want to populate in /dev. Each device node has its own context manager, and is therefore logically separated from all the other device nodes. The config option CONFIG_ANDROID_BINDER_DEVICES can be used to set the default value of the parameter. This approach was favored over using IPC namespaces, mostly because we require a single process to be a part of multiple binder contexts, which seemed harder to achieve with namespaces. [AdrianDC] Backport to 3.4 with list.h iterator hlist_for_each_entry_rcu node kept Change-Id: I3d8531c44e82ef7db4d8b9fa0c1761d4ec282e3d Signed-off-by: Martijn Coenen <[email protected]> Signed-off-by: Adrian DC <[email protected]> android: binder: refactor binder_transact() Moved handling of fixup for binder objects, handles and file descriptors into separate functions. Change-Id: If6849f1caee3834aa87d0ab08950bb1e21ec6e38 Signed-off-by: Martijn Coenen <[email protected]> android: binder: add extra size to allocator. The binder_buffer allocator currently only allocates space for the data and offsets buffers of a Parcel. This change allows for requesting an additional chunk of data in the buffer, which can for example be used to hold additional meta-data about the transaction (eg a security context). Change-Id: I58ab9c383a2e1a3057aae6adaa596ce867f1b157 Signed-off-by: Martijn Coenen <[email protected]> android: binder: support for scatter-gather. Previously all data passed over binder needed to be serialized, with the exception of Binder objects and file descriptors. This patchs adds support for scatter-gathering raw memory buffers into a binder transaction, avoiding the need to first serialize them into a Parcel. To remain backwards compatibile with existing binder clients, it introduces two new command ioctls for this purpose - BC_TRANSACTION_SG and BC_REPLY_SG. These commands may only be used with the new binder_transaction_data_sg structure, which adds a field for the total size of the buffers we are scatter-gathering. Because memory buffers may contain pointers to other buffers, we allow callers to specify a parent buffer and an offset into it, to indicate this is a location pointing to the buffer that we are fixing up. The kernel will then take care of fixing up the pointer to that buffer as well. Change-Id: Ibe17f4f5629d1d541a03f1e826cfd153b64f0d8c Signed-off-by: Martijn Coenen <[email protected]> android: binder: support for file-descriptor arrays. This patch introduces a new binder_fd_array object, that allows us to support one or more file descriptors embedded in a buffer that is scatter-gathered. Change-Id: I647a53cf0d905c7be0dfd9333806982def68dd74 Signed-off-by: Martijn Coenen <[email protected]> binder: don't allow mmap() by process other than proc->tsk we really shouldn't do get_files_struct() on a different process and use it to modify the sucker later on. Change-Id: I2be2b99395b6efa85a007317b25e6e9e7953c47a Signed-off-by: Al Viro <[email protected]> binder: use group leader instead of open thread The binder allocator assumes that the thread that called binder_open will never die for the lifetime of that proc. That thread is normally the group_leader, however it may not be. Use the group_leader instead of current. Bug: 35707103 Test: Created test case to open with temporary thread Change-Id: Id693f74b3591f3524a8c6e9508e70f3e5a80c588 Signed-off-by: Todd Kjos <[email protected]> ANDROID: binder: add hwbinder,vndbinder to BINDER_DEVICES. These will be required going forward. Change-Id: I8f24e1e9f87a6773bd84fb9f173a3725c376c692 Signed-off-by: Martijn Coenen <[email protected]> android: binder: add padding to binder_fd_array_object. binder_fd_array_object starts with a 4-byte header, followed by a few fields that are 8 bytes when ANDROID_BINDER_IPC_32BIT=N. This can cause alignment issues in a 64-bit kernel with a 32-bit userspace, as on x86_32 an 8-byte primitive may be aligned to a 4-byte address. Pad with a __u32 to fix this. Change-Id: I4374ed2cc3ccd3c6a1474cb7209b53ebfd91077b Signed-off-by: Martijn Coenen <[email protected]> BACKPORT: ARM: 8091/2: add get_user() support for 8 byte types Recent contributions, including to DRM and binder, introduce 64-bit values in their interfaces. A common motivation for this is to allow the same ABI for 32- and 64-bit userspaces (and therefore also a shared ABI for 32/64 hybrid userspaces). Anyhow, the developers would like to avoid gotchas like having to use copy_from_user(). This feature is already implemented on x86-32 and the majority of other 32-bit architectures. The current list of get_user_8 hold out architectures are: arm, avr32, blackfin, m32r, metag, microblaze, mn10300, sh. Credit: My name sits rather uneasily at the top of this patch. The v1 and v2 versions of the patch were written by Rob Clark and to produce v4 I mostly copied code from Russell King and H. Peter Anvin. However I have mangled the patch sufficiently that *blame* is rightfully mine even if credit should more widely shared. Changelog: v5: updated to use the ret macro (requested by Russell King) v4: remove an inlined add on big endian systems (spotted by Russell King), used __ARMEB__ rather than BIG_ENDIAN (to match rest of file), cleared r3 on EFAULT during __get_user_8. v3: fix a couple of checkpatch issues v2: pass correct size to check_uaccess, and better handling of narrowing double word read with __get_user_xb() (Russell King's suggestion) v1: original Change-Id: I41787d73f0844c15b6bd0424a5f83cafaba8b508 Signed-off-by: Rob Clark <[email protected]> Signed-off-by: Daniel Thompson <[email protected]> Signed-off-by: Russell King <[email protected]> [flex1911: backport to 3.4: use "mov pc" instruction instead of nonexistent here "ret" macro] Signed-off-by: Artem Borisov <[email protected]> Signed-off-by: D. Andrei Măceș <[email protected]> staging: binder: Improve Kconfig entry for ANDROID_BINDER_IPC_32BIT Add a more clear explanation of the option in the prompt, and make the config depend on ANDROID_BINDER_IPC being selected. Also sets the default to y, which matches AOSP. Change-Id: Ib85275f415a3a10b3b40e0f847cb43fce69c203b Cc: Colin Cross <[email protected]> Cc: Arve Hjønnevåg <[email protected]> Cc: Serban Constantinescu <[email protected]> Cc: Android Kernel Team <[email protected]> Signed-off-by: John Stultz <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: D. Andrei Măceș <[email protected]> UPSTREAM: drivers: android: correct the size of struct binder_uintptr_t for BC_DEAD_BINDER_DONE There's one point was missed in the patch commit da49889deb34 ("staging: binder: Support concurrent 32 bit and 64 bit processes."). When configure BINDER_IPC_32BIT, the size of binder_uintptr_t was 32bits, but size of void * is 64bit on 64bit system. Correct it here. Signed-off-by: Lisa Du <[email protected]> Signed-off-by: Nicolas Boichat <[email protected]> Fixes: da49889deb34 ("staging: binder: Support concurrent 32 bit and 64 bit processes.") Cc: <[email protected]> Acked-by: Olof Johansson <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> (cherry picked from commit 7a64cd887fdb97f074c3fda03bee0bfb9faceac3) [drinkcat: upstream moved drivers/staging/android to drivers/android] BUG=b:26833439 TEST=See b:26833439 comment #22 Signed-off-by: Nicolas Boichat <[email protected]> Change-Id: I4a3fd73fea1cbd5c8ea0a9f33c17ef25006ad0dd Signed-off-by: D. Andrei Măceș <[email protected]> ANDROID: binder: fix compilation warnings. When using the 64-bit binder interface on 32-bit kernels. These are safe to cast, because they are buffers we have previously successfully called copy_from_user() on. Change-Id: Ifa4ef2f99b28202d3d1bd1417e441266a2609cec Signed-off-by: Martijn Coenen <[email protected]> Signed-off-by: D. Andrei Măceș <[email protected]>
alesaiko
pushed a commit
that referenced
this issue
Mar 2, 2018
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior of waking their tasks up on idle CPUs. The feature has too much of a negative impact on other workloads however to apply globally. The PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this flag set, or tasks woken by tasks with this flag set, on an idle CPU if one is available. Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039 Signed-off-by: Steve Muckle <[email protected]> sched: add sysctl for controlling task migrations on wake The PF_WAKE_UP_IDLE per-task flag made it impossible to enable the old behavior of SD_SHARE_PKG_RESOURCES, where every task migrates to an idle CPU on wakeup. The sched_wake_to_idle sysctl value, when made nonzero, will cause all tasks to migrate to an idle CPU if one is available when the task is woken up. This is regardless of how PF_WAKE_UP_IDLE is configured for tasks in the system. Similar to PF_WAKE_UP_IDLE, the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled for the sysctl value to have an effect. Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61 Signed-off-by: Steve Muckle <[email protected]> (cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282) Conflicts: kernel/sched/fair.c sched: Reset rq->next_interval before going idle next_balance, the point in jiffy time scale when a cpu will next load balance, could have been calculated when the cpu was busy. A busy cpu will apply its sched domain's busy_factor (usually > 1) in computing next_balance for that sched domain, which causes the (busy) cpu to load balance less frequently in its sched domains. However when the same cpu is going idle, its next_balance needs to be reset without consideration of busy_factor. Failure to do so would not trigger nohz idle balancer on that cpu for unnecessarily long time (introducing additional scheduling latencies for tasks). Fix bug in scheduler which aims to reset next_balance before a cpu goes idle (as per existing comment) but is clearly not doing so. Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4 Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Fix SCHED_HRTICK bug leading to late preemption of tasks SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot (just when they would have exceeded their ideal_runtime). It makes use of a a per-cpu hrtimer resource and hence alarming that hrtimer should be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs, rather than being based on number of tasks in a particular cfs_rq (as implemented currently). As a result, with current code, its possible for a running task (which is the sole task in its cfs_rq) to be preempted much after its ideal_runtime has elapsed, resulting in increased latency for tasks in other cfs_rq on same cpu. Fix this by alarming sched hrtimer based on total number of SCHED_FAIR tasks a CPU has across its various cfs_rqs. Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Mark schedule_io_timeout() with EXPORT_SYMBOL Make schedule_io_timeout() visible to modules. Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2 Signed-off-by: Jordan Crouse <[email protected]> sched: fix reference to wrong cfs_rq Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption of tasks) introduced a bug in sched_slice() calculation by using wrong cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather than the correct one to which they belonged. Fix the bug by using correct cfs_rq for tasks. Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Make sure to not re-read variables after validation We could re-read rq->rt_avg after we validated it was smaller than total, invalidating the check and resulting in an unintended negative. Signed-off-by: Peter Zijlstra <[email protected]> Cc: David Rientjes <[email protected]> Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins Signed-off-by: Ingo Molnar <[email protected]> CRs-Fixed: 497236 Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Srivatsa Vaddagiri <[email protected]> (cherry picked from commit fef154fa32c5d8ef992f11e11919495979970678) Change-Id: Ieac7a10d9f4ceb701660f8055efd7a0ed5d939ca Signed-off-by: Sridhar Nelloru <[email protected]> sched: re-calculate a cpu's next_balance point upon sched domain changes Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset a cpu's rq->next_balance when pulled_task = 0, which will be true when the cpu failed to pull any task, causing it go idle. However that patch relied on next_balance being calculated as a result of traversing cpu's sched domain hierarchy. A cpu that is the only online cpu will however not be attached to any sched domain hierarchy. When such a cpu calls into idle_balance(), we will end up initializing next_balance to be 1sec away! Such a CPU will defer load balance check for another 1sec, even though we may bring up more cpus in the meantime requiring it to check for load imbalance more frequently. This could then lead to increased scheduling latency for some tasks. This patch results in a cpu's next_balance being re-calculated when its attaching to a new sched domain hierarchy. This should let cpus call load balance checks at the right time we expect them to! Signed-off-by: Srivatsa Vaddagiri <[email protected]> (cherry picked from commit 91b609bdd91513c37c410ffef0204f8a0578e4f8) Change-Id: Ib457f1a357aad26e49837a01311b73e22e0cc26f Signed-off-by: Sridhar Nelloru <[email protected]> sched: Unthrottle rt runqueues in __disable_runtime() migrate_tasks() uses _pick_next_task_rt() to get tasks from the real-time runqueues to be migrated. When rt_rq is throttled _pick_next_task_rt() won't return anything, in which case migrate_tasks() can't move all threads over and gets stuck in an infinite loop. Instead unthrottle rt runqueues before migrating tasks. Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair() Signed-off-by: Peter Boonstoppel <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix clear NOHZ_BALANCE_KICK I have faced a sequence where the Idle Load Balance was sometime not triggered for a while on my platform, in the following scenario: CPU 0 and CPU 1 are running tasks and CPU 2 is idle CPU 1 kicks the Idle Load Balance CPU 1 selects CPU 2 as the new Idle Load Balancer CPU 2 sets NOHZ_BALANCE_KICK for CPU 2 CPU 2 sends a reschedule IPI to CPU 2 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance task A quickly goes back to sleep (before a tick occurs on CPU 2) CPU 2 goes back to idle with NOHZ_BALANCE_KICK set Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be performed. We must wait for the sched softirq to be raised on CPU 2 thanks to another part the kernel to come back to clear NOHZ_BALANCE_KICK. The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if we can't raise the sched_softirq for the Idle Load Balance. Change since V1: - move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB can't run on this CPU (as suggested by Peter) Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 873b4c65b519fd769940eb281f77848227d4e5c1 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [[email protected]: minor merge resolution for 3.4 in scheduler_ipi()] Signed-off-by: Steve Muckle <[email protected]> Change-Id: I3548612057cccc2ecc29429c129c44183083831f sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s try_to_wake_up_local() should only be invoked to wake up another task in the same runqueue and BUG_ON()s are used to enforce the rule. Missing try_to_wake_up_local() can stall workqueue execution but such stalls are likely to be finite either by another work item being queued or the one blocked getting unblocked. There's no reason to trigger BUG while holding rq lock crashing the whole system. Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s. Change-Id: I75fdaaf4dcaefcf3893e0404d98c8aa91f89934d Signed-off-by: Tejun Heo <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Git-commit: 383efcd00053ec40023010ce5034bd702e7ab373 CRs-Fixed: 519942 Signed-off-by: Steve Muckle <[email protected]> sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local() The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local() were recently converted from BUG_ON() calls. If these hit it indicates something is wrong and that may contribute to other system instability. To eliminate the risk of an instance of one of these errors going un-noticed because there was an earlier instance that occured long ago, change to WARN_ON(). If there ever is a flood of these there are bigger problems. Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7 Signed-off-by: Steve Muckle <[email protected]> sched: Convert WARN_ON() to printk_sched() in try_to_wake_up_local() try_to_wake_up_local() is called with the rq lock held. Printing to console in this context can result in a deadlock if klogd needs to be woken up. Print to the kernel log buffer via printk_sched() instead which avoids the wakeup. Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c Signed-off-by: Syed Rameez Mustafa <[email protected]> sched/rt: Use root_domain of rt_rq not current processor When the system has multiple domains do_sched_rt_period_timer() can run on any CPU and may iterate over all rt_rq in cpu_online_mask. This means when balance_runtime() is run for a given rt_rq that rt_rq may be in a different rd than the current processor. Thus if we use smp_processor_id() to get rd in do_balance_runtime() we may borrow runtime from a rt_rq that is not part of our rd. This changes do_balance_runtime to get the rd from the passed in rt_rq ensuring that we borrow runtime only from the correct rd for the given rt_rq. This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime when we try reclaim runtime lent to other rt_rq but runtime has been lent to a rt_rq in another rd. Change-Id: Id8b9f88bdd783ef87bfa8d288a2fe7e86dd0b1e2 Signed-off-by: Shawn Bohrer <[email protected]> Acked-by: Steven Rostedt <[email protected]> Acked-by: Mike Galbraith <[email protected]> Cc: [email protected] Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 89960feebaf4f9a53f93a0ce6888207e4a808799 Git-repo: git://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git Signed-off-by: Syed Rameez Mustafa <[email protected]> sched: Remove redundant update_runtime notifier migration_call() will do all the things that update_runtime() does. So let's remove it. Furthermore, there is potential risk that the current code will catch BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime threads running because of enabling runtime twice while the rt_runtime may already changed. Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8 Signed-off-by: Neil Zhang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [[email protected]: resolved trivial header file context conflict] Signed-off-by: Matt Wagantall <[email protected]> sched: Fix hotplug vs. set_cpus_allowed_ptr() Lai found that: WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b() ... migration_cpu_stop+0x1d/0x22 was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is always a sub-set of cpu_online_mask. This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness"). So set active and online at the same time to avoid this particular problem. CRs-Fixed: 680496 Change-Id: I89ac9b6829acf200072975bc7d028a469167f083 Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness") Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Gautham R. Shenoy <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael wang <[email protected]> Cc: Paul Gortmaker <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Toshi Kani <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 24d52daafcd53f109893ec2faf668c8eeef4a382 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Matt Wagantall <[email protected]> sched: Fix race between try_to_wake_up() and move_task() Until a task's state has been seen as interruptible/uninterruptible and it is no longer on_cpu, it is possible that the task may move to another CPU (load balancing may cause this). Here is an example where the race condition results in incorrect operation: - cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING - cpu 0 runs task B, which attempts to wake up A - cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0 - cpu 1 then pulls task A (perhaps due to idle balance) - cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE - cpu 0 continues in try_to_wake_up(), thinking task A's previous cpu is 0, where it is actually 1 - if select_task_rq returns cpu 0, task A will be woken up on cpu 0 without properly updating its cpu to 0 in set_task_cpu() CRs-Fixed: 665958 Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99 Signed-off-by: Steve Muckle <[email protected]> sched: Fix reference to stale task_struct in try_to_wake_up() try_to_wake_up() currently drops p->pi_lock and later checks for need to notify cpufreq governor on task migrations or wakeups. However the woken task could exit between the time p->pi_lock is released and the time the test for notification is run. As a result, the test for notification could refer to an exited task. task_notify_on_migrate(p) could thus lead to invalid memory reference. Fix this by running the test for notification with task's pi_lock held. Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1 Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Bail out of yield_to when source and target runqueue has one task In case of undercomitted scenarios, especially in large guests yield_to overhead is significantly high. when run queue length of source and target is one, take an opportunity to bail out and return -ESRCH. This return condition can be further exploited to quickly come out of PLE handler. (History: Raghavendra initially worked on break out of kvm ple handler upon seeing source runqueue length = 1, but it had to export rq length). Peter came up with the elegant idea of return -ESRCH in scheduler core. Signed-off-by: Peter Zijlstra <[email protected]> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi) Reviewed-by: Srikar Dronamraju <[email protected]> Signed-off-by: Raghavendra K T <[email protected]> Acked-by: Andrew Jones <[email protected]> Tested-by: Chegu Vinod <[email protected]> Signed-off-by: Gleb Natapov <[email protected]> Change-Id: I4eca8afc683c1a4b9d2490422baacd3da76c1e92 sched: Fix select_idle_sibling() bouncing cow syndrome If the previous CPU is cache affine and idle, select it. The current implementation simply traverses the sd_llc domain, taking the first idle CPU encountered, which walks buddy pairs hand in hand over the package, inflicting excruciating pain. 1 tbench pair (worst case) in a 10 core + SMT package: pre 15.22 MB/sec 1 procs post 252.01 MB/sec 1 procs Signed-off-by: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix double normalization of vruntime dequeue_entity() is called when p->on_rq and sets se->on_rq = 0 which appears to guarentee that the !se->on_rq condition is met. If the task has done set_current_state(TASK_INTERRUPTIBLE) without schedule() the second condition will be met and vruntime will be incorrectly adjusted twice. In certain cases this can result in the task's vruntime never increasing past the vruntime of other tasks on the CFS' run queue, starving them of CPU time. This patch changes switched_from_fair() to use !p->on_rq instead of !se->on_rq. I'm able to cause a task with a priority of 120 to starve all other tasks with the same priority on an ARM platform running 3.2.51-rt72 PREEMPT RT by writing one character at time to a serial tty (16550 UART) in a tight loop. I'm also able to verify making this change corrects the problem on that platform and kernel version. Change-Id: I70aa0fe0e9be146b1a2a903d010b318390262dc3 Signed-off-by: George McCollister <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix divide by zero at {thread_group,task}_times On architectures where cputime_t is 64 bit type, is possible to trigger divide by zero on do_div(temp, (__force u32) total) line, if total is a non zero number but has lower 32 bit's zeroed. Removing casting is not a good solution since some do_div() implementations do cast to u32 internally. This problem can be triggered in practice on very long lived processes: PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin" #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2 #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00 #3 [ffff880472a51cd0] die at ffffffff8100f26b #4 [ffff880472a51d00] do_trap at ffffffff814f03f4 #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff #6 [ffff880472a51e00] divide_error at ffffffff8100be7b [exception RIP: thread_group_times+0x56] RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046 RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800 R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08 R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d #8 [ffff880472a51f40] sys_times at ffffffff81088524 #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2 RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202 RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000 RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0 RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940 R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940 ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b Change-Id: I1e0fa1fa38b86566f5b20e5c0efa0c9eb17b203b Signed-off-by: Stanislaw Gruszka <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched_clock: Prevent 64bit inatomicity on 32bit systems The sched_clock_remote() implementation has the following inatomicity problem on 32bit systems when accessing the remote scd->clock, which is a 64bit value. CPU0 CPU1 sched_clock_local() sched_clock_remote(CPU0) ... remote_clock = scd[CPU0]->clock read_low32bit(scd[CPU0]->clock) cmpxchg64(scd->clock,...) read_high32bit(scd[CPU0]->clock) While the update of scd->clock is using an atomic64 mechanism, the readout on the remote cpu is not, which can cause completely bogus readouts. It is a quite rare problem, because it requires the update to hit the narrow race window between the low/high readout and the update must go across the 32bit boundary. The resulting misbehaviour is, that CPU1 will see the sched_clock on CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value to this bogus timestamp. This stays that way due to the clamping implementation for about 4 seconds until the synchronization with CLOCK_MONOTONIC undoes the problem. The issue is hard to observe, because it might only result in a less accurate SCHED_OTHER timeslicing behaviour. To create observable damage on realtime scheduling classes, it is necessary that the bogus update of CPU1 sched_clock happens in the context of an realtime thread, which then gets charged 4 seconds of RT runtime, which results in the RT throttler mechanism to trigger and prevent scheduling of RT tasks for a little less than 4 seconds. So this is quite unlikely as well. The issue was quite hard to decode as the reproduction time is between 2 days and 3 weeks and intrusive tracing makes it less likely, but the following trace recorded with trace_clock=global, which uses sched_clock_local(), gave the final hint: <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 What happens is that CPU0 goes idle and invokes sched_clock_idle_sleep_event() which invokes sched_clock_local() and CPU1 runs a remote wakeup for CPU0 at the same time, which invokes sched_remote_clock(). The time jump gets propagated to CPU0 via sched_remote_clock() and stays stale on both cores for ~4 seconds. There are only two other possibilities, which could cause a stale sched clock: 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic wrong value. 2) sched_clock() which reads the TSC returns a sporadic wrong value. one jiffy and then go stale. forward. It would just result in a stale sched_clock for one jiffy. After quite some brain twisting and finding the same pattern on other traces, sched_clock_remote() remained the only place which could cause such a problem and as explained above it's indeed racy on 32bit systems. So while on 64bit systems the readout is atomic, we need to verify the remote readout on 32bit machines. We need to protect the local->clock readout in sched_clock_remote() on 32bit as well because an NMI could hit between the low and the high readout, call sched_clock_local() and modify local->clock. Thanks to Siegfried Wulsch for bearing with my debug requests and going through the tedious tasks of running a bunch of reproducer systems to generate the debug information which let me decode the issue. Change-Id: I767a6a645e17651b39d0a530dc0681756f3b4f5d Reported-by: Siegfried Wulsch <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities This patch touches the RT group scheduling case. Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's priority, while rt_rq passed to them may be not the top-level rt_rq. This is wrong, because changing of priority on a child level does not guarantee that the priority is the highest all over the rq. So, this leak makes RT balancing unusable. The short example: the task having the highest priority among all rq's RT tasks (no one other task has the same priority) are waking on a throttle rt_rq. The rq's cpupri is set to the task's priority equivalent, but real rq->rt.highest_prio.curr is less. The patch below fixes the problem. Change-Id: I01d3cc07870b1825efa1deed3b70f060ac20939f Signed-off-by: Kirill Tkhai <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> CC: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Guarantee new group-entities always have weight Currently, group entity load-weights are initialized to zero. This admits some races with respect to the first time they are re-weighted in earlty use. ( Let g[x] denote the se for "g" on cpu "x". ) Suppose that we have root->a and that a enters a throttled state, immediately followed by a[0]->t1 (the only task running on cpu[0]) blocking: put_prev_task(group_cfs_rq(a[0]), t1) put_prev_entity(..., t1) check_cfs_rq_runtime(group_cfs_rq(a[0])) throttle_cfs_rq(group_cfs_rq(a[0])) Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first time: enqueue_task_fair(rq[0], t2) enqueue_entity(group_cfs_rq(b[0]), t2) enqueue_entity_load_avg(group_cfs_rq(b[0]), t2) account_entity_enqueue(group_cfs_ra(b[0]), t2) update_cfs_shares(group_cfs_rq(b[0])) < skipped because b is part of a throttled hierarchy > enqueue_entity(group_cfs_rq(a[0]), b[0]) ... We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0 which violates invariants in several code-paths. Eliminate the possibility of this by initializing group entity weight. Change-Id: I1b09c5c4f85df1de565e7c862ba8dde65851c05f Signed-off-by: Paul Turner <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <[email protected]> Cc: Chris J Arges <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Fix isolated CPUs leaving root_task_group indefinitely throttled Root task group bandwidth replenishment must service all CPUs, regardless of where the timer was last started, and regardless of the isolation mechanism, lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy. Change-Id: I8485724e3ff0960008aa13634e55bcb6ceacbc01 Signed-off-by: Mike Galbraith <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Avoid updating RT entry timeout twice within one tick period The issue below was found in 2.6.34-rt rather than mainline rt kernel, but the issue still exists upstream as well. So please let me describe how it was noticed on 2.6.34-rt: On this version, each softirq has its own thread, which means there is at least one RT FIFO task per cpu. The priority of these tasks is set to 49 by default. If user launches an RT FIFO task with priority lower than 49 of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued one cpu runqueue at one moment. By current strategy of balancing RT tasks, when it comes to RT tasks, we really need to put them off to a CPU that they can run on as soon as possible. Even if it means a bit of cache line flushing, we want RT tasks to be run with the least latency. When the user RT FIFO task which just launched before is running, the sched timer tick of the current cpu happens. In this tick period, the timeout value of the user RT task will be updated once. Subsequently, we try to wake up one softirq RT task on its local cpu. As the priority of current user RT task is lower than the softirq RT task, the current task will be preempted by the higher priority softirq RT task. Before preemption, we check to see if current can readily move to a different cpu. If so, we will reschedule to allow the RT push logic to try to move current somewhere else. Whenever the woken softirq RT task runs, it first tries to migrate the user FIFO RT task over to a cpu that is running a task of lesser priority. If migration is done, it will send a reschedule request to the found cpu by IPI interrupt. Once the target cpu responds the IPI interrupt, it will pick the migrated user RT task to preempt its current task. When the user RT task is running on the new cpu, the sched timer tick of the cpu fires. So it will tick the user RT task again. This also means the RT task timeout value will be updated again. As the migration may be done in one tick period, it means the user RT task timeout value will be updated twice within one tick. If we set a limit on the amount of cpu time for the user RT task by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon reaching the soft limit. But exactly when the SIGXCPU signal should be sent depends on the RT task timeout value. In fact the timeout mechanism of sending the SIGXCPU signal assumes the RT task timeout is increased once every tick. However, currently the timeout value may be added twice per tick. So it results in the SIGXCPU signal being sent earlier than expected. To solve this issue, we prevent the timeout value from increasing twice within one tick time by remembering the jiffies value of last updating the timeout. As long as the RT task's jiffies is different with the global jiffies value, we allow its timeout to be updated. Signed-off-by: Ying Xue <[email protected]> Signed-off-by: Fan Du <[email protected]> Reviewed-by: Yong Zhang <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> [ lizf: backported to 3.4: adjust context ] Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Change-Id: I1d126959d88f3bc1adb4a707661efb7162048424 sched: rt: Reduce rq lock contention by eliminating locking of non-feasible target This patch adds checks that prevens futile attempts to move rt tasks to a CPU with active tasks of equal or higher priority. This reduces run queue lock contention and improves the performance of a well known OLTP benchmark by 0.7%. Change-Id: I232abe50623593d51be508fc64e8b865dcc77212 Signed-off-by: Tim Chen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Shawn Bohrer <[email protected]> Cc: Suruchi Kadu <[email protected]> Cc: Doug Nelson<[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Zefan Li <[email protected]> sched: Fix RLIMIT_RTTIME when PI-boosting to RT When non-realtime tasks get priority-inheritance boosted to a realtime scheduling class, RLIMIT_RTTIME starts to apply to them. However, the counter used for checking this (the same one used for SCHED_RR timeslices) was not getting reset. This meant that tasks running with a non-realtime scheduling class which are repeatedly boosted to a realtime one, but never block while they are running realtime, eventually hit the timeout without ever running for a time over the limit. This patch resets the realtime timeslice counter when un-PI-boosting from an RT to a non-RT scheduling class. I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT mutex which induces priority boosting and spins while boosted that gets killed by a SIGXCPU on non-fixed kernels but doesn't with this patch applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and does happen eventually with PREEMPT_VOLUNTARY kernels. Change-Id: I732036119517d9a9625cea4644b4509c1d6cdf13 Signed-off-by: Brian Silverman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> [lizf: Backported to 3.4: adjust contest] Signed-off-by: Zefan Li <[email protected]> sched: Queue RT tasks to head when prio drops The following scenario does not work correctly: Runqueue of CPUx contains two runnable and pinned tasks: T1: SCHED_FIFO, prio 80 T2: SCHED_FIFO, prio 80 T1 is on the cpu and executes the following syscalls (classic priority ceiling scenario): sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90); ... sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80); ... Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back to sleep the scheduler picks T2. Surprise! The same happens w/o actual preemption when T1 is forced into the scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes pick_next_task() which returns T2. So T1 gets preempted and scheduled out. This happens because sched_setscheduler() dequeues T1 from the prio 90 list and then enqueues it on the tail of the prio 80 list behind T2. This violates the POSIX spec and surprises user space which relies on the guarantee that SCHED_FIFO tasks are not scheduled out unless they give the CPU up voluntarily or are preempted by a higher priority task. In the latter case the preempted task must get back on the CPU after the preempting task schedules out again. We fixed a similar issue already in commit 60db48c (sched: Queue a deboosted task to the head of the RT prio queue). The same treatment is necessary for sched_setscheduler(). So enqueue to head of the prio bucket list if the priority of the task is lowered. It might be possible that existing user space relies on the current behaviour, but it can be considered highly unlikely due to the corner case nature of the application scenario. Change-Id: I65f159b3c2d265dcaf27b70bf9be88cf922ab845 Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Zefan Li <[email protected]> sched: Fix the theoretical signal_wake_up() vs schedule() race This is only theoretical, but after try_to_wake_up(p) was changed to check p->state under p->pi_lock the code like: __set_current_state(TASK_INTERRUPTIBLE); schedule(); can miss a signal. This is the special case of wait-for-condition, it relies on try_to_wake_up/schedule interaction and thus it does not need mb() between __set_current_state() and if (signal_pending). However, this __set_current_state() can move into the critical section protected by rq->lock, now that try_to_wake_up() takes another lock we need to ensure that it can't be reordered with "if (signal_pending(current))" check inside that section. The patch is actually one-liner, it simply adds smp_wmb() before spin_lock_irq(rq->lock). This is what try_to_wake_up() already does by the same reason. We turn this wmb() into the new helper, smp_mb__before_spinlock(), for better documentation and to allow the architectures to change the default implementation. While at it, kill smp_mb__after_lock(), it has no callers. Perhaps we can also add smp_mb__before/after_spinunlock() for prepare_to_wait(). Change-Id: I2b329b2ae213cc9221811ca244d2ebb044011a1b Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix race in idle_balance() The scheduler main function 'schedule()' checks if there are no more tasks on the runqueue. Then it checks if a task should be pulled in the current runqueue in idle_balance() assuming it will go to idle otherwise. But idle_balance() releases the rq->lock in order to look up the sched domains and takes the lock again right after. That opens a window where another cpu may put a task in our runqueue, so we won't go to idle but we have filled the idle_stamp, thinking we will. This patch closes the window by checking if the runqueue has been modified but without pulling a task after taking the lock again, so we won't go to idle right after in the __schedule() function. Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8 Signed-off-by: Daniel Lezcano <[email protected]> Cc: [email protected] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix load avg vs cpu-hotplug Rabik and Paul reported two different issues related to the same few lines of code. Rabik's issue is that the nr_uninterruptible migration code is wrong in that he sees artifacts due to this (Rabik please do expand in more detail). Paul's issue is that this code as it stands relies on us using stop_machine() for unplug, we all would like to remove this assumption so that eventually we can remove this stop_machine() usage altogether. The only reason we'd have to migrate nr_uninterruptible is so that we could use for_each_online_cpu() loops in favour of for_each_possible_cpu() loops, however since nr_uninterruptible() is the only such loop and its using possible lets not bother at all. The problem Rabik sees is (probably) caused by the fact that by migrating nr_uninterruptible we screw rq->calc_load_active for both rqs involved. So don't bother with fancy migration schemes (meaning we now have to keep using for_each_possible_cpu()) and instead fold any nr_active delta after we migrate all tasks away to make sure we don't have any skewed nr_active accounting. Reported-by: Rakib Mullick <[email protected]> Reported-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins Signed-off-by: Ingo Molnar <[email protected]> sched: Fix load avg vs. cpu-hotplug Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an incomplete fix: In particular, the problem is that at the point it calls calc_load_migrate() nr_running := 1 (the stopper thread), so move the call to CPU_DEAD where we're sure that nr_running := 0. Also note that we can call calc_load_migrate() without serialization, we know the state of rq is stable since its cpu is dead, and we modify the global state using appropriate atomic ops. Change-Id: I24063eebf4429d7e01838af2cb68e0f3c2cf2131 Suggested-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins Signed-off-by: Ingo Molnar <[email protected]> sched/nohz: Rewrite and fix load-avg computation -- again Thanks to Charles Wang for spotting the defects in the current code: - If we go idle during the sample window -- after sampling, we get a negative bias because we can negate our own sample. - If we wake up during the sample window we get a positive bias because we push the sample to a known active period. So rewrite the entire nohz load-avg muck once again, now adding copious documentation to the code. Reported-and-tested-by: Doug Smythies <[email protected]> Reported-and-tested-by: Charles Wang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins [ minor edits ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Change-Id: Ib735002c79eaace9fa5f7795fdc2f80b638b56a7 sched: Fix the broken sched_rr_get_interval() The caller of sched_sliced() should pass se.cfs_rq and se as the arguments, however in sched_rr_get_interval() we gave it rq.cfs_rq and se, which made the following computation obviously wrong. The change was introduced by commit: 77034937dc45 sched: fix crash in sys_sched_rr_get_interval() ... 5 years ago, while it had been the correct 'cfs_rq_of' before the commit. The change seems to be irrelevant to the commit msg, which was to return a 0 timeslice for tasks that are on an idle runqueue. So I believe that was just a plain typo. Signed-off-by: Zhu Yanhai <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] [ Since this is an ABI and an old bug, we'll test this via a slow upstream route, to hopefully discover any app breakage. ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix race in task_group() Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call task_group() too many times in set_task_rq()"). He found the reason to be that the multiple task_group() invocations in set_task_rq() returned different values. Looking at all that I found a lack of serialization and plain wrong comments. The below tries to fix it using an extra pointer which is updated under the appropriate scheduler locks. Its not pretty, but I can't really see another way given how all the cgroup stuff works. Change-Id: I6906c92ea6f66d19af5938a9f5ec9f1ff4577c39 Reported-and-tested-by: Stefan Bader <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed masks as and when necessary to ensure that the tasks belonging to the cpusets have some place (online CPUs) to run on. And regular CPU hotplug is destructive in the sense that the kernel doesn't remember the original cpuset configurations set by the user, across hotplug operations. However, suspend/resume (which uses CPU hotplug) is a special case in which the kernel has the responsibility to restore the system (during resume), to exactly the same state it was in before suspend. In order to achieve that, do the following: 1. Don't modify cpusets during suspend/resume. At all. In particular, don't move the tasks from one cpuset to another, and don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets during the CPU hotplug operations that are carried out in the suspend/resume path. 2. However, cpusets and sched domains are related. We just want to avoid altering cpusets alone. So, to keep the sched domains updated, build a single sched domain (containing all active cpus) during each of the CPU hotplug operations carried out in s/r path, effectively ignoring the cpusets' cpus_allowed masks. (Since userspace is frozen while doing all this, it will go unnoticed.) 3. During the last CPU online operation during resume, build the sched domains by looking up the (unaltered) cpusets' cpus_allowed masks. That will bring back the system to the same original state as it was in before suspend. Ultimately, this will not only solve the cpuset problem related to suspend resume (ie., restores the cpusets to exactly what it was before suspend, by not touching it at all) but also speeds up suspend/resume because we avoid running cpuset update code for every CPU being offlined/onlined. Change-Id: I12c3e6289d5aa069531f8f616d63412b38103d99 Signed-off-by: Srivatsa S. Bhat <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Preeti U Murthy <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones There is a small race between copy_process() and cgroup_attach_task() where child->se.parent,cfs_rq points to invalid (old) ones. parent doing fork() | someone moving the parent to another cgroup -------------------------------+--------------------------------------------- copy_process() + dup_task_struct() -> parent->se is copied to child->se. se.parent,cfs_rq of them point to old ones. cgroup_attach_task() + cgroup_task_migrate() -> parent->cgroup is updated. + cpu_cgroup_attach() + sched_move_task() + task_move_group_fair() +- set_task_rq() -> se.parent,cfs_rq of parent are updated. + cgroup_fork() -> parent->cgroup is copied to child->cgroup. (*1) + sched_fork() + task_fork_fair() -> se.parent,cfs_rq of child are accessed while they point to old ones. (*2) In the worst case, this bug can lead to "use-after-free" and cause a panic, because it's new cgroup's refcount that is incremented at (*1), so the old cgroup(and related data) can be freed before (*2). In fact, a panic caused by this bug was originally caught in RHEL6.4. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0 [...] Call Trace: [<ffffffff81051f25>] place_entity+0x75/0xa0 [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160 [<ffffffff81063c0b>] sched_fork+0x6b/0x140 [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450 [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130 [<ffffffff8106d2f4>] do_fork+0x94/0x460 [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100 [<ffffffff81009598>] sys_clone+0x28/0x30 [<ffffffff8100b393>] stub_clone+0x13/0x20 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Daisuke Nishimura <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/nohz: Fix rq->cpu_load[] calculations While investigating why the load-balancer did funny I found that the rq->cpu_load[] tables were completely screwy.. a bit more digging revealed that the updates that got through were missing ticks followed by a catchup of 2 ticks. The catchup assumes the cpu was idle during that time (since only nohz can cause missed ticks and the machine is idle etc..) this means that esp. the higher indices were significantly lower than they ought to be. The reason for this is that its not correct to compare against jiffies on every jiffy on any other cpu than the cpu that updates jiffies. This patch cludges around it by only doing the catch-up stuff from nohz_idle_balance() and doing the regular stuff unconditionally from the tick. Change-Id: Iac6eda1ded5074afb261ccc0fe8a4e70432bb031 Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Venkatesh Pallipadi <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/nohz: Fix rq->cpu_load calculations some more commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream. Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[] calculations") since while that fixed the busy case it regressed the mostly idle case. Add a callback from the nohz exit to also age the rq->cpu_load[] array. This closes the hole where either there was no nohz load balance pass during the nohz, or there was a 'significant' amount of idle time between the last nohz balance and the nohz exit. So we'll update unconditionally from the tick to not insert any accidental 0 load periods while busy, and we try and catch up from nohz idle balance and nohz exit. Both these are still prone to missing a jiffy, but that has always been the case. Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Venkatesh Pallipadi <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/core: Clear the root_domain cpumasks in init_rootdomain() root_domain::rto_mask allocated through alloc_cpumask_var() contains garbage data, and this may cause problems. For instance, when doing pull_rt_task(), it may do useless iterations if rto_mask retains some extra garbage bits. Worse still, this violates the isolated domain rule for clustered scheduling using cpuset, because the tasks (with all the cpus allowed) belongs to one root domain can be pulled away into another root domain. The patch cleans the garbage by using zalloc_cpumask_var() instead of alloc_cpumask_var() for root_domain::rto_mask allocation, thereby addressing the issues. Do the same thing for root_domain's other cpumask memembers: - span, - online; Signed-off-by: Xunlei Pang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> softirq: Reduce latencies In various network workloads, __do_softirq() latencies can be up to 20 ms if HZ=1000, and 200 ms if HZ=100. This is because we iterate 10 times in the softirq dispatcher, and some actions can consume a lot of cycles. This patch changes the fallback to ksoftirqd condition to: - A time limit of 2 ms, - need_resched() being set on current task; When one of this condition is met, we wakeup ksoftirqd for further softirq processing if we still have pending softirqs. Using need_resched() as the only condition can trigger RCU stalls, as we can keep BH disabled for too long. I ran several benchmarks and got no significant difference in throughput, but a very significant reduction of latencies (one order of magnitude): In following bench, 200 antagonist "netperf -t TCP_RR" are started in background, using all available cpus. Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC IRQ (hard + soft). Before patch: RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=550110.424 MIN_LATENCY=146858 MAX_LATENCY=997109 P50_LATENCY=305000 P90_LATENCY=550000 P99_LATENCY=710000 MEAN_LATENCY=376989.12 STDDEV_LATENCY=184046.92 After patch: RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=40545.492 MIN_LATENCY=9834 MAX_LATENCY=78366 P50_LATENCY=33583 P90_LATENCY=59000 P99_LATENCY=69000 MEAN_LATENCY=38364.67 STDDEV_LATENCY=12865.26 Signed-off-by: Eric Dumazet <[email protected]> Cc: David Miller <[email protected]> Cc: Tom Herbert <[email protected]> Cc: Ben Hutchings <[email protected]> Signed-off-by: David S. Miller <[email protected]> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <[email protected]> Signed-off-by: Zefan Li <[email protected]> softirq: Fix lockup related to stop_machine being stuck in __do_softirq. The stop machine logic can lock up if all but one of the migration threads make it through the disable-irq step and the one remaining thread gets stuck in __do_softirq. The reason __do_softirq can hang is that it has a bail-out based on jiffies timeout, but in the lockup case, jiffies itself is not incremented. To work around this, re-add the max_restart counter in __do_irq and stop processing irqs after 10 restarts. Thanks to Tejun Heo and Rusty Russell and others for helping me track this down. This was introduced in 3.9 by commit c10d73671ad3 ("softirq: reduce latencies"). It may be worth looking into ath9k to see if it has issues with its irq handler at a later date. The hang stack traces look something like this: ------------[ cut here ]------------ WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7() Watchdog detected hard LOCKUP on cpu 2 Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] Pid: 23, comm: migration/2 Tainted: G C 3.9.4+ #11 Call Trace: <NMI> warn_slowpath_common+0x85/0x9f warn_slowpath_fmt+0x46/0x48 watchdog_overflow_callback+0x9c/0xa7 __perf_event_overflow+0x137/0x1cb perf_event_overflow+0x14/0x16 intel_pmu_handle_irq+0x2dc/0x359 perf_event_nmi_handler+0x19/0x1b nmi_handle+0x7f/0xc2 do_nmi+0xbc/0x304 end_repeat_nmi+0x1e/0x2e <<EOE>> cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 ---[ end trace 4947dfa9b0a4cec3 ]--- BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17] Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] irq event stamp: 835637905 hardirqs last enabled at (835637904): __do_softirq+0x9f/0x257 hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80 softirqs last enabled at (5654720): __do_softirq+0x1ff/0x257 softirqs last disabled at (5654725): irq_exit+0x5f/0xbb CPU 1 Pid: 17, comm: migration/1 Tainted: G WC 3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: tasklet_hi_action+0xf0/0xf0 Process migration/1 Call Trace: <IRQ> __do_softirq+0x117/0x257 irq_exit+0x5f/0xbb smp_apic_timer_interrupt+0x8a/0x98 apic_timer_interrupt+0x72/0x80 <EOI> printk+0x4d/0x4f stop_machine_cpu_stop+0x22c/0x274 cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 Change-Id: Ib213d5a5c59c85ae16afde6f42a01e39d2c93fd7 Signed-off-by: Ben Greear <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Pekka Riikonen <[email protected]> Cc: Eric Dumazet <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <[email protected]> Signed-off-by: Zefan Li <[email protected]>
alesaiko
pushed a commit
that referenced
this issue
Mar 11, 2018
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior of waking their tasks up on idle CPUs. The feature has too much of a negative impact on other workloads however to apply globally. The PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this flag set, or tasks woken by tasks with this flag set, on an idle CPU if one is available. Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039 Signed-off-by: Steve Muckle <[email protected]> sched: add sysctl for controlling task migrations on wake The PF_WAKE_UP_IDLE per-task flag made it impossible to enable the old behavior of SD_SHARE_PKG_RESOURCES, where every task migrates to an idle CPU on wakeup. The sched_wake_to_idle sysctl value, when made nonzero, will cause all tasks to migrate to an idle CPU if one is available when the task is woken up. This is regardless of how PF_WAKE_UP_IDLE is configured for tasks in the system. Similar to PF_WAKE_UP_IDLE, the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled for the sysctl value to have an effect. Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61 Signed-off-by: Steve Muckle <[email protected]> (cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282) Conflicts: kernel/sched/fair.c sched: Reset rq->next_interval before going idle next_balance, the point in jiffy time scale when a cpu will next load balance, could have been calculated when the cpu was busy. A busy cpu will apply its sched domain's busy_factor (usually > 1) in computing next_balance for that sched domain, which causes the (busy) cpu to load balance less frequently in its sched domains. However when the same cpu is going idle, its next_balance needs to be reset without consideration of busy_factor. Failure to do so would not trigger nohz idle balancer on that cpu for unnecessarily long time (introducing additional scheduling latencies for tasks). Fix bug in scheduler which aims to reset next_balance before a cpu goes idle (as per existing comment) but is clearly not doing so. Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4 Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Fix SCHED_HRTICK bug leading to late preemption of tasks SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot (just when they would have exceeded their ideal_runtime). It makes use of a a per-cpu hrtimer resource and hence alarming that hrtimer should be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs, rather than being based on number of tasks in a particular cfs_rq (as implemented currently). As a result, with current code, its possible for a running task (which is the sole task in its cfs_rq) to be preempted much after its ideal_runtime has elapsed, resulting in increased latency for tasks in other cfs_rq on same cpu. Fix this by alarming sched hrtimer based on total number of SCHED_FAIR tasks a CPU has across its various cfs_rqs. Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Mark schedule_io_timeout() with EXPORT_SYMBOL Make schedule_io_timeout() visible to modules. Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2 Signed-off-by: Jordan Crouse <[email protected]> sched: fix reference to wrong cfs_rq Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption of tasks) introduced a bug in sched_slice() calculation by using wrong cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather than the correct one to which they belonged. Fix the bug by using correct cfs_rq for tasks. Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Make sure to not re-read variables after validation We could re-read rq->rt_avg after we validated it was smaller than total, invalidating the check and resulting in an unintended negative. Signed-off-by: Peter Zijlstra <[email protected]> Cc: David Rientjes <[email protected]> Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins Signed-off-by: Ingo Molnar <[email protected]> CRs-Fixed: 497236 Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Srivatsa Vaddagiri <[email protected]> (cherry picked from commit fef154fa32c5d8ef992f11e11919495979970678) Change-Id: Ieac7a10d9f4ceb701660f8055efd7a0ed5d939ca Signed-off-by: Sridhar Nelloru <[email protected]> sched: re-calculate a cpu's next_balance point upon sched domain changes Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset a cpu's rq->next_balance when pulled_task = 0, which will be true when the cpu failed to pull any task, causing it go idle. However that patch relied on next_balance being calculated as a result of traversing cpu's sched domain hierarchy. A cpu that is the only online cpu will however not be attached to any sched domain hierarchy. When such a cpu calls into idle_balance(), we will end up initializing next_balance to be 1sec away! Such a CPU will defer load balance check for another 1sec, even though we may bring up more cpus in the meantime requiring it to check for load imbalance more frequently. This could then lead to increased scheduling latency for some tasks. This patch results in a cpu's next_balance being re-calculated when its attaching to a new sched domain hierarchy. This should let cpus call load balance checks at the right time we expect them to! Signed-off-by: Srivatsa Vaddagiri <[email protected]> (cherry picked from commit 91b609bdd91513c37c410ffef0204f8a0578e4f8) Change-Id: Ib457f1a357aad26e49837a01311b73e22e0cc26f Signed-off-by: Sridhar Nelloru <[email protected]> sched: Unthrottle rt runqueues in __disable_runtime() migrate_tasks() uses _pick_next_task_rt() to get tasks from the real-time runqueues to be migrated. When rt_rq is throttled _pick_next_task_rt() won't return anything, in which case migrate_tasks() can't move all threads over and gets stuck in an infinite loop. Instead unthrottle rt runqueues before migrating tasks. Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair() Signed-off-by: Peter Boonstoppel <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix clear NOHZ_BALANCE_KICK I have faced a sequence where the Idle Load Balance was sometime not triggered for a while on my platform, in the following scenario: CPU 0 and CPU 1 are running tasks and CPU 2 is idle CPU 1 kicks the Idle Load Balance CPU 1 selects CPU 2 as the new Idle Load Balancer CPU 2 sets NOHZ_BALANCE_KICK for CPU 2 CPU 2 sends a reschedule IPI to CPU 2 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance task A quickly goes back to sleep (before a tick occurs on CPU 2) CPU 2 goes back to idle with NOHZ_BALANCE_KICK set Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be performed. We must wait for the sched softirq to be raised on CPU 2 thanks to another part the kernel to come back to clear NOHZ_BALANCE_KICK. The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if we can't raise the sched_softirq for the Idle Load Balance. Change since V1: - move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB can't run on this CPU (as suggested by Peter) Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 873b4c65b519fd769940eb281f77848227d4e5c1 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [[email protected]: minor merge resolution for 3.4 in scheduler_ipi()] Signed-off-by: Steve Muckle <[email protected]> Change-Id: I3548612057cccc2ecc29429c129c44183083831f sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s try_to_wake_up_local() should only be invoked to wake up another task in the same runqueue and BUG_ON()s are used to enforce the rule. Missing try_to_wake_up_local() can stall workqueue execution but such stalls are likely to be finite either by another work item being queued or the one blocked getting unblocked. There's no reason to trigger BUG while holding rq lock crashing the whole system. Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s. Change-Id: I75fdaaf4dcaefcf3893e0404d98c8aa91f89934d Signed-off-by: Tejun Heo <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Git-commit: 383efcd00053ec40023010ce5034bd702e7ab373 CRs-Fixed: 519942 Signed-off-by: Steve Muckle <[email protected]> sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local() The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local() were recently converted from BUG_ON() calls. If these hit it indicates something is wrong and that may contribute to other system instability. To eliminate the risk of an instance of one of these errors going un-noticed because there was an earlier instance that occured long ago, change to WARN_ON(). If there ever is a flood of these there are bigger problems. Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7 Signed-off-by: Steve Muckle <[email protected]> sched: Convert WARN_ON() to printk_sched() in try_to_wake_up_local() try_to_wake_up_local() is called with the rq lock held. Printing to console in this context can result in a deadlock if klogd needs to be woken up. Print to the kernel log buffer via printk_sched() instead which avoids the wakeup. Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c Signed-off-by: Syed Rameez Mustafa <[email protected]> sched/rt: Use root_domain of rt_rq not current processor When the system has multiple domains do_sched_rt_period_timer() can run on any CPU and may iterate over all rt_rq in cpu_online_mask. This means when balance_runtime() is run for a given rt_rq that rt_rq may be in a different rd than the current processor. Thus if we use smp_processor_id() to get rd in do_balance_runtime() we may borrow runtime from a rt_rq that is not part of our rd. This changes do_balance_runtime to get the rd from the passed in rt_rq ensuring that we borrow runtime only from the correct rd for the given rt_rq. This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime when we try reclaim runtime lent to other rt_rq but runtime has been lent to a rt_rq in another rd. Change-Id: Id8b9f88bdd783ef87bfa8d288a2fe7e86dd0b1e2 Signed-off-by: Shawn Bohrer <[email protected]> Acked-by: Steven Rostedt <[email protected]> Acked-by: Mike Galbraith <[email protected]> Cc: [email protected] Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 89960feebaf4f9a53f93a0ce6888207e4a808799 Git-repo: git://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git Signed-off-by: Syed Rameez Mustafa <[email protected]> sched: Remove redundant update_runtime notifier migration_call() will do all the things that update_runtime() does. So let's remove it. Furthermore, there is potential risk that the current code will catch BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime threads running because of enabling runtime twice while the rt_runtime may already changed. Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8 Signed-off-by: Neil Zhang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [[email protected]: resolved trivial header file context conflict] Signed-off-by: Matt Wagantall <[email protected]> sched: Fix hotplug vs. set_cpus_allowed_ptr() Lai found that: WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b() ... migration_cpu_stop+0x1d/0x22 was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is always a sub-set of cpu_online_mask. This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness"). So set active and online at the same time to avoid this particular problem. CRs-Fixed: 680496 Change-Id: I89ac9b6829acf200072975bc7d028a469167f083 Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness") Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Gautham R. Shenoy <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael wang <[email protected]> Cc: Paul Gortmaker <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Toshi Kani <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 24d52daafcd53f109893ec2faf668c8eeef4a382 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Matt Wagantall <[email protected]> sched: Fix race between try_to_wake_up() and move_task() Until a task's state has been seen as interruptible/uninterruptible and it is no longer on_cpu, it is possible that the task may move to another CPU (load balancing may cause this). Here is an example where the race condition results in incorrect operation: - cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING - cpu 0 runs task B, which attempts to wake up A - cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0 - cpu 1 then pulls task A (perhaps due to idle balance) - cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE - cpu 0 continues in try_to_wake_up(), thinking task A's previous cpu is 0, where it is actually 1 - if select_task_rq returns cpu 0, task A will be woken up on cpu 0 without properly updating its cpu to 0 in set_task_cpu() CRs-Fixed: 665958 Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99 Signed-off-by: Steve Muckle <[email protected]> sched: Fix reference to stale task_struct in try_to_wake_up() try_to_wake_up() currently drops p->pi_lock and later checks for need to notify cpufreq governor on task migrations or wakeups. However the woken task could exit between the time p->pi_lock is released and the time the test for notification is run. As a result, the test for notification could refer to an exited task. task_notify_on_migrate(p) could thus lead to invalid memory reference. Fix this by running the test for notification with task's pi_lock held. Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1 Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Bail out of yield_to when source and target runqueue has one task In case of undercomitted scenarios, especially in large guests yield_to overhead is significantly high. when run queue length of source and target is one, take an opportunity to bail out and return -ESRCH. This return condition can be further exploited to quickly come out of PLE handler. (History: Raghavendra initially worked on break out of kvm ple handler upon seeing source runqueue length = 1, but it had to export rq length). Peter came up with the elegant idea of return -ESRCH in scheduler core. Signed-off-by: Peter Zijlstra <[email protected]> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi) Reviewed-by: Srikar Dronamraju <[email protected]> Signed-off-by: Raghavendra K T <[email protected]> Acked-by: Andrew Jones <[email protected]> Tested-by: Chegu Vinod <[email protected]> Signed-off-by: Gleb Natapov <[email protected]> Change-Id: I4eca8afc683c1a4b9d2490422baacd3da76c1e92 sched: Fix select_idle_sibling() bouncing cow syndrome If the previous CPU is cache affine and idle, select it. The current implementation simply traverses the sd_llc domain, taking the first idle CPU encountered, which walks buddy pairs hand in hand over the package, inflicting excruciating pain. 1 tbench pair (worst case) in a 10 core + SMT package: pre 15.22 MB/sec 1 procs post 252.01 MB/sec 1 procs Signed-off-by: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix double normalization of vruntime dequeue_entity() is called when p->on_rq and sets se->on_rq = 0 which appears to guarentee that the !se->on_rq condition is met. If the task has done set_current_state(TASK_INTERRUPTIBLE) without schedule() the second condition will be met and vruntime will be incorrectly adjusted twice. In certain cases this can result in the task's vruntime never increasing past the vruntime of other tasks on the CFS' run queue, starving them of CPU time. This patch changes switched_from_fair() to use !p->on_rq instead of !se->on_rq. I'm able to cause a task with a priority of 120 to starve all other tasks with the same priority on an ARM platform running 3.2.51-rt72 PREEMPT RT by writing one character at time to a serial tty (16550 UART) in a tight loop. I'm also able to verify making this change corrects the problem on that platform and kernel version. Change-Id: I70aa0fe0e9be146b1a2a903d010b318390262dc3 Signed-off-by: George McCollister <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix divide by zero at {thread_group,task}_times On architectures where cputime_t is 64 bit type, is possible to trigger divide by zero on do_div(temp, (__force u32) total) line, if total is a non zero number but has lower 32 bit's zeroed. Removing casting is not a good solution since some do_div() implementations do cast to u32 internally. This problem can be triggered in practice on very long lived processes: PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin" #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2 #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00 #3 [ffff880472a51cd0] die at ffffffff8100f26b #4 [ffff880472a51d00] do_trap at ffffffff814f03f4 #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff #6 [ffff880472a51e00] divide_error at ffffffff8100be7b [exception RIP: thread_group_times+0x56] RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046 RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800 R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08 R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d #8 [ffff880472a51f40] sys_times at ffffffff81088524 #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2 RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202 RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000 RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0 RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940 R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940 ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b Change-Id: I1e0fa1fa38b86566f5b20e5c0efa0c9eb17b203b Signed-off-by: Stanislaw Gruszka <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched_clock: Prevent 64bit inatomicity on 32bit systems The sched_clock_remote() implementation has the following inatomicity problem on 32bit systems when accessing the remote scd->clock, which is a 64bit value. CPU0 CPU1 sched_clock_local() sched_clock_remote(CPU0) ... remote_clock = scd[CPU0]->clock read_low32bit(scd[CPU0]->clock) cmpxchg64(scd->clock,...) read_high32bit(scd[CPU0]->clock) While the update of scd->clock is using an atomic64 mechanism, the readout on the remote cpu is not, which can cause completely bogus readouts. It is a quite rare problem, because it requires the update to hit the narrow race window between the low/high readout and the update must go across the 32bit boundary. The resulting misbehaviour is, that CPU1 will see the sched_clock on CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value to this bogus timestamp. This stays that way due to the clamping implementation for about 4 seconds until the synchronization with CLOCK_MONOTONIC undoes the problem. The issue is hard to observe, because it might only result in a less accurate SCHED_OTHER timeslicing behaviour. To create observable damage on realtime scheduling classes, it is necessary that the bogus update of CPU1 sched_clock happens in the context of an realtime thread, which then gets charged 4 seconds of RT runtime, which results in the RT throttler mechanism to trigger and prevent scheduling of RT tasks for a little less than 4 seconds. So this is quite unlikely as well. The issue was quite hard to decode as the reproduction time is between 2 days and 3 weeks and intrusive tracing makes it less likely, but the following trace recorded with trace_clock=global, which uses sched_clock_local(), gave the final hint: <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 What happens is that CPU0 goes idle and invokes sched_clock_idle_sleep_event() which invokes sched_clock_local() and CPU1 runs a remote wakeup for CPU0 at the same time, which invokes sched_remote_clock(). The time jump gets propagated to CPU0 via sched_remote_clock() and stays stale on both cores for ~4 seconds. There are only two other possibilities, which could cause a stale sched clock: 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic wrong value. 2) sched_clock() which reads the TSC returns a sporadic wrong value. one jiffy and then go stale. forward. It would just result in a stale sched_clock for one jiffy. After quite some brain twisting and finding the same pattern on other traces, sched_clock_remote() remained the only place which could cause such a problem and as explained above it's indeed racy on 32bit systems. So while on 64bit systems the readout is atomic, we need to verify the remote readout on 32bit machines. We need to protect the local->clock readout in sched_clock_remote() on 32bit as well because an NMI could hit between the low and the high readout, call sched_clock_local() and modify local->clock. Thanks to Siegfried Wulsch for bearing with my debug requests and going through the tedious tasks of running a bunch of reproducer systems to generate the debug information which let me decode the issue. Change-Id: I767a6a645e17651b39d0a530dc0681756f3b4f5d Reported-by: Siegfried Wulsch <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities This patch touches the RT group scheduling case. Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's priority, while rt_rq passed to them may be not the top-level rt_rq. This is wrong, because changing of priority on a child level does not guarantee that the priority is the highest all over the rq. So, this leak makes RT balancing unusable. The short example: the task having the highest priority among all rq's RT tasks (no one other task has the same priority) are waking on a throttle rt_rq. The rq's cpupri is set to the task's priority equivalent, but real rq->rt.highest_prio.curr is less. The patch below fixes the problem. Change-Id: I01d3cc07870b1825efa1deed3b70f060ac20939f Signed-off-by: Kirill Tkhai <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> CC: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Guarantee new group-entities always have weight Currently, group entity load-weights are initialized to zero. This admits some races with respect to the first time they are re-weighted in earlty use. ( Let g[x] denote the se for "g" on cpu "x". ) Suppose that we have root->a and that a enters a throttled state, immediately followed by a[0]->t1 (the only task running on cpu[0]) blocking: put_prev_task(group_cfs_rq(a[0]), t1) put_prev_entity(..., t1) check_cfs_rq_runtime(group_cfs_rq(a[0])) throttle_cfs_rq(group_cfs_rq(a[0])) Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first time: enqueue_task_fair(rq[0], t2) enqueue_entity(group_cfs_rq(b[0]), t2) enqueue_entity_load_avg(group_cfs_rq(b[0]), t2) account_entity_enqueue(group_cfs_ra(b[0]), t2) update_cfs_shares(group_cfs_rq(b[0])) < skipped because b is part of a throttled hierarchy > enqueue_entity(group_cfs_rq(a[0]), b[0]) ... We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0 which violates invariants in several code-paths. Eliminate the possibility of this by initializing group entity weight. Change-Id: I1b09c5c4f85df1de565e7c862ba8dde65851c05f Signed-off-by: Paul Turner <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <[email protected]> Cc: Chris J Arges <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Fix isolated CPUs leaving root_task_group indefinitely throttled Root task group bandwidth replenishment must service all CPUs, regardless of where the timer was last started, and regardless of the isolation mechanism, lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy. Change-Id: I8485724e3ff0960008aa13634e55bcb6ceacbc01 Signed-off-by: Mike Galbraith <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Avoid updating RT entry timeout twice within one tick period The issue below was found in 2.6.34-rt rather than mainline rt kernel, but the issue still exists upstream as well. So please let me describe how it was noticed on 2.6.34-rt: On this version, each softirq has its own thread, which means there is at least one RT FIFO task per cpu. The priority of these tasks is set to 49 by default. If user launches an RT FIFO task with priority lower than 49 of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued one cpu runqueue at one moment. By current strategy of balancing RT tasks, when it comes to RT tasks, we really need to put them off to a CPU that they can run on as soon as possible. Even if it means a bit of cache line flushing, we want RT tasks to be run with the least latency. When the user RT FIFO task which just launched before is running, the sched timer tick of the current cpu happens. In this tick period, the timeout value of the user RT task will be updated once. Subsequently, we try to wake up one softirq RT task on its local cpu. As the priority of current user RT task is lower than the softirq RT task, the current task will be preempted by the higher priority softirq RT task. Before preemption, we check to see if current can readily move to a different cpu. If so, we will reschedule to allow the RT push logic to try to move current somewhere else. Whenever the woken softirq RT task runs, it first tries to migrate the user FIFO RT task over to a cpu that is running a task of lesser priority. If migration is done, it will send a reschedule request to the found cpu by IPI interrupt. Once the target cpu responds the IPI interrupt, it will pick the migrated user RT task to preempt its current task. When the user RT task is running on the new cpu, the sched timer tick of the cpu fires. So it will tick the user RT task again. This also means the RT task timeout value will be updated again. As the migration may be done in one tick period, it means the user RT task timeout value will be updated twice within one tick. If we set a limit on the amount of cpu time for the user RT task by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon reaching the soft limit. But exactly when the SIGXCPU signal should be sent depends on the RT task timeout value. In fact the timeout mechanism of sending the SIGXCPU signal assumes the RT task timeout is increased once every tick. However, currently the timeout value may be added twice per tick. So it results in the SIGXCPU signal being sent earlier than expected. To solve this issue, we prevent the timeout value from increasing twice within one tick time by remembering the jiffies value of last updating the timeout. As long as the RT task's jiffies is different with the global jiffies value, we allow its timeout to be updated. Signed-off-by: Ying Xue <[email protected]> Signed-off-by: Fan Du <[email protected]> Reviewed-by: Yong Zhang <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> [ lizf: backported to 3.4: adjust context ] Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Change-Id: I1d126959d88f3bc1adb4a707661efb7162048424 sched: rt: Reduce rq lock contention by eliminating locking of non-feasible target This patch adds checks that prevens futile attempts to move rt tasks to a CPU with active tasks of equal or higher priority. This reduces run queue lock contention and improves the performance of a well known OLTP benchmark by 0.7%. Change-Id: I232abe50623593d51be508fc64e8b865dcc77212 Signed-off-by: Tim Chen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Shawn Bohrer <[email protected]> Cc: Suruchi Kadu <[email protected]> Cc: Doug Nelson<[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Zefan Li <[email protected]> sched: Fix RLIMIT_RTTIME when PI-boosting to RT When non-realtime tasks get priority-inheritance boosted to a realtime scheduling class, RLIMIT_RTTIME starts to apply to them. However, the counter used for checking this (the same one used for SCHED_RR timeslices) was not getting reset. This meant that tasks running with a non-realtime scheduling class which are repeatedly boosted to a realtime one, but never block while they are running realtime, eventually hit the timeout without ever running for a time over the limit. This patch resets the realtime timeslice counter when un-PI-boosting from an RT to a non-RT scheduling class. I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT mutex which induces priority boosting and spins while boosted that gets killed by a SIGXCPU on non-fixed kernels but doesn't with this patch applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and does happen eventually with PREEMPT_VOLUNTARY kernels. Change-Id: I732036119517d9a9625cea4644b4509c1d6cdf13 Signed-off-by: Brian Silverman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> [lizf: Backported to 3.4: adjust contest] Signed-off-by: Zefan Li <[email protected]> sched: Queue RT tasks to head when prio drops The following scenario does not work correctly: Runqueue of CPUx contains two runnable and pinned tasks: T1: SCHED_FIFO, prio 80 T2: SCHED_FIFO, prio 80 T1 is on the cpu and executes the following syscalls (classic priority ceiling scenario): sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90); ... sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80); ... Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back to sleep the scheduler picks T2. Surprise! The same happens w/o actual preemption when T1 is forced into the scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes pick_next_task() which returns T2. So T1 gets preempted and scheduled out. This happens because sched_setscheduler() dequeues T1 from the prio 90 list and then enqueues it on the tail of the prio 80 list behind T2. This violates the POSIX spec and surprises user space which relies on the guarantee that SCHED_FIFO tasks are not scheduled out unless they give the CPU up voluntarily or are preempted by a higher priority task. In the latter case the preempted task must get back on the CPU after the preempting task schedules out again. We fixed a similar issue already in commit 60db48c (sched: Queue a deboosted task to the head of the RT prio queue). The same treatment is necessary for sched_setscheduler(). So enqueue to head of the prio bucket list if the priority of the task is lowered. It might be possible that existing user space relies on the current behaviour, but it can be considered highly unlikely due to the corner case nature of the application scenario. Change-Id: I65f159b3c2d265dcaf27b70bf9be88cf922ab845 Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Zefan Li <[email protected]> sched: Fix the theoretical signal_wake_up() vs schedule() race This is only theoretical, but after try_to_wake_up(p) was changed to check p->state under p->pi_lock the code like: __set_current_state(TASK_INTERRUPTIBLE); schedule(); can miss a signal. This is the special case of wait-for-condition, it relies on try_to_wake_up/schedule interaction and thus it does not need mb() between __set_current_state() and if (signal_pending). However, this __set_current_state() can move into the critical section protected by rq->lock, now that try_to_wake_up() takes another lock we need to ensure that it can't be reordered with "if (signal_pending(current))" check inside that section. The patch is actually one-liner, it simply adds smp_wmb() before spin_lock_irq(rq->lock). This is what try_to_wake_up() already does by the same reason. We turn this wmb() into the new helper, smp_mb__before_spinlock(), for better documentation and to allow the architectures to change the default implementation. While at it, kill smp_mb__after_lock(), it has no callers. Perhaps we can also add smp_mb__before/after_spinunlock() for prepare_to_wait(). Change-Id: I2b329b2ae213cc9221811ca244d2ebb044011a1b Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix race in idle_balance() The scheduler main function 'schedule()' checks if there are no more tasks on the runqueue. Then it checks if a task should be pulled in the current runqueue in idle_balance() assuming it will go to idle otherwise. But idle_balance() releases the rq->lock in order to look up the sched domains and takes the lock again right after. That opens a window where another cpu may put a task in our runqueue, so we won't go to idle but we have filled the idle_stamp, thinking we will. This patch closes the window by checking if the runqueue has been modified but without pulling a task after taking the lock again, so we won't go to idle right after in the __schedule() function. Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8 Signed-off-by: Daniel Lezcano <[email protected]> Cc: [email protected] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix load avg vs cpu-hotplug Rabik and Paul reported two different issues related to the same few lines of code. Rabik's issue is that the nr_uninterruptible migration code is wrong in that he sees artifacts due to this (Rabik please do expand in more detail). Paul's issue is that this code as it stands relies on us using stop_machine() for unplug, we all would like to remove this assumption so that eventually we can remove this stop_machine() usage altogether. The only reason we'd have to migrate nr_uninterruptible is so that we could use for_each_online_cpu() loops in favour of for_each_possible_cpu() loops, however since nr_uninterruptible() is the only such loop and its using possible lets not bother at all. The problem Rabik sees is (probably) caused by the fact that by migrating nr_uninterruptible we screw rq->calc_load_active for both rqs involved. So don't bother with fancy migration schemes (meaning we now have to keep using for_each_possible_cpu()) and instead fold any nr_active delta after we migrate all tasks away to make sure we don't have any skewed nr_active accounting. Reported-by: Rakib Mullick <[email protected]> Reported-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins Signed-off-by: Ingo Molnar <[email protected]> sched: Fix load avg vs. cpu-hotplug Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an incomplete fix: In particular, the problem is that at the point it calls calc_load_migrate() nr_running := 1 (the stopper thread), so move the call to CPU_DEAD where we're sure that nr_running := 0. Also note that we can call calc_load_migrate() without serialization, we know the state of rq is stable since its cpu is dead, and we modify the global state using appropriate atomic ops. Change-Id: I24063eebf4429d7e01838af2cb68e0f3c2cf2131 Suggested-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins Signed-off-by: Ingo Molnar <[email protected]> sched/nohz: Rewrite and fix load-avg computation -- again Thanks to Charles Wang for spotting the defects in the current code: - If we go idle during the sample window -- after sampling, we get a negative bias because we can negate our own sample. - If we wake up during the sample window we get a positive bias because we push the sample to a known active period. So rewrite the entire nohz load-avg muck once again, now adding copious documentation to the code. Reported-and-tested-by: Doug Smythies <[email protected]> Reported-and-tested-by: Charles Wang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins [ minor edits ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Change-Id: Ib735002c79eaace9fa5f7795fdc2f80b638b56a7 sched: Fix the broken sched_rr_get_interval() The caller of sched_sliced() should pass se.cfs_rq and se as the arguments, however in sched_rr_get_interval() we gave it rq.cfs_rq and se, which made the following computation obviously wrong. The change was introduced by commit: 77034937dc45 sched: fix crash in sys_sched_rr_get_interval() ... 5 years ago, while it had been the correct 'cfs_rq_of' before the commit. The change seems to be irrelevant to the commit msg, which was to return a 0 timeslice for tasks that are on an idle runqueue. So I believe that was just a plain typo. Signed-off-by: Zhu Yanhai <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] [ Since this is an ABI and an old bug, we'll test this via a slow upstream route, to hopefully discover any app breakage. ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix race in task_group() Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call task_group() too many times in set_task_rq()"). He found the reason to be that the multiple task_group() invocations in set_task_rq() returned different values. Looking at all that I found a lack of serialization and plain wrong comments. The below tries to fix it using an extra pointer which is updated under the appropriate scheduler locks. Its not pretty, but I can't really see another way given how all the cgroup stuff works. Change-Id: I6906c92ea6f66d19af5938a9f5ec9f1ff4577c39 Reported-and-tested-by: Stefan Bader <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed masks as and when necessary to ensure that the tasks belonging to the cpusets have some place (online CPUs) to run on. And regular CPU hotplug is destructive in the sense that the kernel doesn't remember the original cpuset configurations set by the user, across hotplug operations. However, suspend/resume (which uses CPU hotplug) is a special case in which the kernel has the responsibility to restore the system (during resume), to exactly the same state it was in before suspend. In order to achieve that, do the following: 1. Don't modify cpusets during suspend/resume. At all. In particular, don't move the tasks from one cpuset to another, and don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets during the CPU hotplug operations that are carried out in the suspend/resume path. 2. However, cpusets and sched domains are related. We just want to avoid altering cpusets alone. So, to keep the sched domains updated, build a single sched domain (containing all active cpus) during each of the CPU hotplug operations carried out in s/r path, effectively ignoring the cpusets' cpus_allowed masks. (Since userspace is frozen while doing all this, it will go unnoticed.) 3. During the last CPU online operation during resume, build the sched domains by looking up the (unaltered) cpusets' cpus_allowed masks. That will bring back the system to the same original state as it was in before suspend. Ultimately, this will not only solve the cpuset problem related to suspend resume (ie., restores the cpusets to exactly what it was before suspend, by not touching it at all) but also speeds up suspend/resume because we avoid running cpuset update code for every CPU being offlined/onlined. Change-Id: I12c3e6289d5aa069531f8f616d63412b38103d99 Signed-off-by: Srivatsa S. Bhat <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Preeti U Murthy <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones There is a small race between copy_process() and cgroup_attach_task() where child->se.parent,cfs_rq points to invalid (old) ones. parent doing fork() | someone moving the parent to another cgroup -------------------------------+--------------------------------------------- copy_process() + dup_task_struct() -> parent->se is copied to child->se. se.parent,cfs_rq of them point to old ones. cgroup_attach_task() + cgroup_task_migrate() -> parent->cgroup is updated. + cpu_cgroup_attach() + sched_move_task() + task_move_group_fair() +- set_task_rq() -> se.parent,cfs_rq of parent are updated. + cgroup_fork() -> parent->cgroup is copied to child->cgroup. (*1) + sched_fork() + task_fork_fair() -> se.parent,cfs_rq of child are accessed while they point to old ones. (*2) In the worst case, this bug can lead to "use-after-free" and cause a panic, because it's new cgroup's refcount that is incremented at (*1), so the old cgroup(and related data) can be freed before (*2). In fact, a panic caused by this bug was originally caught in RHEL6.4. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0 [...] Call Trace: [<ffffffff81051f25>] place_entity+0x75/0xa0 [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160 [<ffffffff81063c0b>] sched_fork+0x6b/0x140 [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450 [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130 [<ffffffff8106d2f4>] do_fork+0x94/0x460 [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100 [<ffffffff81009598>] sys_clone+0x28/0x30 [<ffffffff8100b393>] stub_clone+0x13/0x20 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Daisuke Nishimura <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/nohz: Fix rq->cpu_load[] calculations While investigating why the load-balancer did funny I found that the rq->cpu_load[] tables were completely screwy.. a bit more digging revealed that the updates that got through were missing ticks followed by a catchup of 2 ticks. The catchup assumes the cpu was idle during that time (since only nohz can cause missed ticks and the machine is idle etc..) this means that esp. the higher indices were significantly lower than they ought to be. The reason for this is that its not correct to compare against jiffies on every jiffy on any other cpu than the cpu that updates jiffies. This patch cludges around it by only doing the catch-up stuff from nohz_idle_balance() and doing the regular stuff unconditionally from the tick. Change-Id: Iac6eda1ded5074afb261ccc0fe8a4e70432bb031 Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Venkatesh Pallipadi <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/nohz: Fix rq->cpu_load calculations some more commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream. Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[] calculations") since while that fixed the busy case it regressed the mostly idle case. Add a callback from the nohz exit to also age the rq->cpu_load[] array. This closes the hole where either there was no nohz load balance pass during the nohz, or there was a 'significant' amount of idle time between the last nohz balance and the nohz exit. So we'll update unconditionally from the tick to not insert any accidental 0 load periods while busy, and we try and catch up from nohz idle balance and nohz exit. Both these are still prone to missing a jiffy, but that has always been the case. Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Venkatesh Pallipadi <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/core: Clear the root_domain cpumasks in init_rootdomain() root_domain::rto_mask allocated through alloc_cpumask_var() contains garbage data, and this may cause problems. For instance, when doing pull_rt_task(), it may do useless iterations if rto_mask retains some extra garbage bits. Worse still, this violates the isolated domain rule for clustered scheduling using cpuset, because the tasks (with all the cpus allowed) belongs to one root domain can be pulled away into another root domain. The patch cleans the garbage by using zalloc_cpumask_var() instead of alloc_cpumask_var() for root_domain::rto_mask allocation, thereby addressing the issues. Do the same thing for root_domain's other cpumask memembers: - span, - online; Signed-off-by: Xunlei Pang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> softirq: Reduce latencies In various network workloads, __do_softirq() latencies can be up to 20 ms if HZ=1000, and 200 ms if HZ=100. This is because we iterate 10 times in the softirq dispatcher, and some actions can consume a lot of cycles. This patch changes the fallback to ksoftirqd condition to: - A time limit of 2 ms, - need_resched() being set on current task; When one of this condition is met, we wakeup ksoftirqd for further softirq processing if we still have pending softirqs. Using need_resched() as the only condition can trigger RCU stalls, as we can keep BH disabled for too long. I ran several benchmarks and got no significant difference in throughput, but a very significant reduction of latencies (one order of magnitude): In following bench, 200 antagonist "netperf -t TCP_RR" are started in background, using all available cpus. Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC IRQ (hard + soft). Before patch: RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=550110.424 MIN_LATENCY=146858 MAX_LATENCY=997109 P50_LATENCY=305000 P90_LATENCY=550000 P99_LATENCY=710000 MEAN_LATENCY=376989.12 STDDEV_LATENCY=184046.92 After patch: RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=40545.492 MIN_LATENCY=9834 MAX_LATENCY=78366 P50_LATENCY=33583 P90_LATENCY=59000 P99_LATENCY=69000 MEAN_LATENCY=38364.67 STDDEV_LATENCY=12865.26 Signed-off-by: Eric Dumazet <[email protected]> Cc: David Miller <[email protected]> Cc: Tom Herbert <[email protected]> Cc: Ben Hutchings <[email protected]> Signed-off-by: David S. Miller <[email protected]> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <[email protected]> Signed-off-by: Zefan Li <[email protected]> softirq: Fix lockup related to stop_machine being stuck in __do_softirq. The stop machine logic can lock up if all but one of the migration threads make it through the disable-irq step and the one remaining thread gets stuck in __do_softirq. The reason __do_softirq can hang is that it has a bail-out based on jiffies timeout, but in the lockup case, jiffies itself is not incremented. To work around this, re-add the max_restart counter in __do_irq and stop processing irqs after 10 restarts. Thanks to Tejun Heo and Rusty Russell and others for helping me track this down. This was introduced in 3.9 by commit c10d73671ad3 ("softirq: reduce latencies"). It may be worth looking into ath9k to see if it has issues with its irq handler at a later date. The hang stack traces look something like this: ------------[ cut here ]------------ WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7() Watchdog detected hard LOCKUP on cpu 2 Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] Pid: 23, comm: migration/2 Tainted: G C 3.9.4+ #11 Call Trace: <NMI> warn_slowpath_common+0x85/0x9f warn_slowpath_fmt+0x46/0x48 watchdog_overflow_callback+0x9c/0xa7 __perf_event_overflow+0x137/0x1cb perf_event_overflow+0x14/0x16 intel_pmu_handle_irq+0x2dc/0x359 perf_event_nmi_handler+0x19/0x1b nmi_handle+0x7f/0xc2 do_nmi+0xbc/0x304 end_repeat_nmi+0x1e/0x2e <<EOE>> cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 ---[ end trace 4947dfa9b0a4cec3 ]--- BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17] Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] irq event stamp: 835637905 hardirqs last enabled at (835637904): __do_softirq+0x9f/0x257 hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80 softirqs last enabled at (5654720): __do_softirq+0x1ff/0x257 softirqs last disabled at (5654725): irq_exit+0x5f/0xbb CPU 1 Pid: 17, comm: migration/1 Tainted: G WC 3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: tasklet_hi_action+0xf0/0xf0 Process migration/1 Call Trace: <IRQ> __do_softirq+0x117/0x257 irq_exit+0x5f/0xbb smp_apic_timer_interrupt+0x8a/0x98 apic_timer_interrupt+0x72/0x80 <EOI> printk+0x4d/0x4f stop_machine_cpu_stop+0x22c/0x274 cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 Change-Id: Ib213d5a5c59c85ae16afde6f42a01e39d2c93fd7 Signed-off-by: Ben Greear <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Pekka Riikonen <[email protected]> Cc: Eric Dumazet <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <[email protected]> Signed-off-by: Zefan Li <[email protected]>
alesaiko
pushed a commit
that referenced
this issue
Mar 12, 2018
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior of waking their tasks up on idle CPUs. The feature has too much of a negative impact on other workloads however to apply globally. The PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this flag set, or tasks woken by tasks with this flag set, on an idle CPU if one is available. Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039 Signed-off-by: Steve Muckle <[email protected]> sched: add sysctl for controlling task migrations on wake The PF_WAKE_UP_IDLE per-task flag made it impossible to enable the old behavior of SD_SHARE_PKG_RESOURCES, where every task migrates to an idle CPU on wakeup. The sched_wake_to_idle sysctl value, when made nonzero, will cause all tasks to migrate to an idle CPU if one is available when the task is woken up. This is regardless of how PF_WAKE_UP_IDLE is configured for tasks in the system. Similar to PF_WAKE_UP_IDLE, the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled for the sysctl value to have an effect. Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61 Signed-off-by: Steve Muckle <[email protected]> (cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282) Conflicts: kernel/sched/fair.c sched: Reset rq->next_interval before going idle next_balance, the point in jiffy time scale when a cpu will next load balance, could have been calculated when the cpu was busy. A busy cpu will apply its sched domain's busy_factor (usually > 1) in computing next_balance for that sched domain, which causes the (busy) cpu to load balance less frequently in its sched domains. However when the same cpu is going idle, its next_balance needs to be reset without consideration of busy_factor. Failure to do so would not trigger nohz idle balancer on that cpu for unnecessarily long time (introducing additional scheduling latencies for tasks). Fix bug in scheduler which aims to reset next_balance before a cpu goes idle (as per existing comment) but is clearly not doing so. Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4 Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Fix SCHED_HRTICK bug leading to late preemption of tasks SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot (just when they would have exceeded their ideal_runtime). It makes use of a a per-cpu hrtimer resource and hence alarming that hrtimer should be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs, rather than being based on number of tasks in a particular cfs_rq (as implemented currently). As a result, with current code, its possible for a running task (which is the sole task in its cfs_rq) to be preempted much after its ideal_runtime has elapsed, resulting in increased latency for tasks in other cfs_rq on same cpu. Fix this by alarming sched hrtimer based on total number of SCHED_FAIR tasks a CPU has across its various cfs_rqs. Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Mark schedule_io_timeout() with EXPORT_SYMBOL Make schedule_io_timeout() visible to modules. Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2 Signed-off-by: Jordan Crouse <[email protected]> sched: fix reference to wrong cfs_rq Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption of tasks) introduced a bug in sched_slice() calculation by using wrong cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather than the correct one to which they belonged. Fix the bug by using correct cfs_rq for tasks. Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Make sure to not re-read variables after validation We could re-read rq->rt_avg after we validated it was smaller than total, invalidating the check and resulting in an unintended negative. Signed-off-by: Peter Zijlstra <[email protected]> Cc: David Rientjes <[email protected]> Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins Signed-off-by: Ingo Molnar <[email protected]> CRs-Fixed: 497236 Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Srivatsa Vaddagiri <[email protected]> (cherry picked from commit fef154fa32c5d8ef992f11e11919495979970678) Change-Id: Ieac7a10d9f4ceb701660f8055efd7a0ed5d939ca Signed-off-by: Sridhar Nelloru <[email protected]> sched: re-calculate a cpu's next_balance point upon sched domain changes Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset a cpu's rq->next_balance when pulled_task = 0, which will be true when the cpu failed to pull any task, causing it go idle. However that patch relied on next_balance being calculated as a result of traversing cpu's sched domain hierarchy. A cpu that is the only online cpu will however not be attached to any sched domain hierarchy. When such a cpu calls into idle_balance(), we will end up initializing next_balance to be 1sec away! Such a CPU will defer load balance check for another 1sec, even though we may bring up more cpus in the meantime requiring it to check for load imbalance more frequently. This could then lead to increased scheduling latency for some tasks. This patch results in a cpu's next_balance being re-calculated when its attaching to a new sched domain hierarchy. This should let cpus call load balance checks at the right time we expect them to! Signed-off-by: Srivatsa Vaddagiri <[email protected]> (cherry picked from commit 91b609bdd91513c37c410ffef0204f8a0578e4f8) Change-Id: Ib457f1a357aad26e49837a01311b73e22e0cc26f Signed-off-by: Sridhar Nelloru <[email protected]> sched: Unthrottle rt runqueues in __disable_runtime() migrate_tasks() uses _pick_next_task_rt() to get tasks from the real-time runqueues to be migrated. When rt_rq is throttled _pick_next_task_rt() won't return anything, in which case migrate_tasks() can't move all threads over and gets stuck in an infinite loop. Instead unthrottle rt runqueues before migrating tasks. Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair() Signed-off-by: Peter Boonstoppel <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix clear NOHZ_BALANCE_KICK I have faced a sequence where the Idle Load Balance was sometime not triggered for a while on my platform, in the following scenario: CPU 0 and CPU 1 are running tasks and CPU 2 is idle CPU 1 kicks the Idle Load Balance CPU 1 selects CPU 2 as the new Idle Load Balancer CPU 2 sets NOHZ_BALANCE_KICK for CPU 2 CPU 2 sends a reschedule IPI to CPU 2 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance task A quickly goes back to sleep (before a tick occurs on CPU 2) CPU 2 goes back to idle with NOHZ_BALANCE_KICK set Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be performed. We must wait for the sched softirq to be raised on CPU 2 thanks to another part the kernel to come back to clear NOHZ_BALANCE_KICK. The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if we can't raise the sched_softirq for the Idle Load Balance. Change since V1: - move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB can't run on this CPU (as suggested by Peter) Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 873b4c65b519fd769940eb281f77848227d4e5c1 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [[email protected]: minor merge resolution for 3.4 in scheduler_ipi()] Signed-off-by: Steve Muckle <[email protected]> Change-Id: I3548612057cccc2ecc29429c129c44183083831f sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s try_to_wake_up_local() should only be invoked to wake up another task in the same runqueue and BUG_ON()s are used to enforce the rule. Missing try_to_wake_up_local() can stall workqueue execution but such stalls are likely to be finite either by another work item being queued or the one blocked getting unblocked. There's no reason to trigger BUG while holding rq lock crashing the whole system. Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s. Change-Id: I75fdaaf4dcaefcf3893e0404d98c8aa91f89934d Signed-off-by: Tejun Heo <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Git-commit: 383efcd00053ec40023010ce5034bd702e7ab373 CRs-Fixed: 519942 Signed-off-by: Steve Muckle <[email protected]> sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local() The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local() were recently converted from BUG_ON() calls. If these hit it indicates something is wrong and that may contribute to other system instability. To eliminate the risk of an instance of one of these errors going un-noticed because there was an earlier instance that occured long ago, change to WARN_ON(). If there ever is a flood of these there are bigger problems. Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7 Signed-off-by: Steve Muckle <[email protected]> sched: Convert WARN_ON() to printk_sched() in try_to_wake_up_local() try_to_wake_up_local() is called with the rq lock held. Printing to console in this context can result in a deadlock if klogd needs to be woken up. Print to the kernel log buffer via printk_sched() instead which avoids the wakeup. Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c Signed-off-by: Syed Rameez Mustafa <[email protected]> sched/rt: Use root_domain of rt_rq not current processor When the system has multiple domains do_sched_rt_period_timer() can run on any CPU and may iterate over all rt_rq in cpu_online_mask. This means when balance_runtime() is run for a given rt_rq that rt_rq may be in a different rd than the current processor. Thus if we use smp_processor_id() to get rd in do_balance_runtime() we may borrow runtime from a rt_rq that is not part of our rd. This changes do_balance_runtime to get the rd from the passed in rt_rq ensuring that we borrow runtime only from the correct rd for the given rt_rq. This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime when we try reclaim runtime lent to other rt_rq but runtime has been lent to a rt_rq in another rd. Change-Id: Id8b9f88bdd783ef87bfa8d288a2fe7e86dd0b1e2 Signed-off-by: Shawn Bohrer <[email protected]> Acked-by: Steven Rostedt <[email protected]> Acked-by: Mike Galbraith <[email protected]> Cc: [email protected] Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 89960feebaf4f9a53f93a0ce6888207e4a808799 Git-repo: git://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git Signed-off-by: Syed Rameez Mustafa <[email protected]> sched: Remove redundant update_runtime notifier migration_call() will do all the things that update_runtime() does. So let's remove it. Furthermore, there is potential risk that the current code will catch BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime threads running because of enabling runtime twice while the rt_runtime may already changed. Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8 Signed-off-by: Neil Zhang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [[email protected]: resolved trivial header file context conflict] Signed-off-by: Matt Wagantall <[email protected]> sched: Fix hotplug vs. set_cpus_allowed_ptr() Lai found that: WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b() ... migration_cpu_stop+0x1d/0x22 was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is always a sub-set of cpu_online_mask. This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness"). So set active and online at the same time to avoid this particular problem. CRs-Fixed: 680496 Change-Id: I89ac9b6829acf200072975bc7d028a469167f083 Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness") Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Gautham R. Shenoy <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael wang <[email protected]> Cc: Paul Gortmaker <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Toshi Kani <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 24d52daafcd53f109893ec2faf668c8eeef4a382 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Matt Wagantall <[email protected]> sched: Fix race between try_to_wake_up() and move_task() Until a task's state has been seen as interruptible/uninterruptible and it is no longer on_cpu, it is possible that the task may move to another CPU (load balancing may cause this). Here is an example where the race condition results in incorrect operation: - cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING - cpu 0 runs task B, which attempts to wake up A - cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0 - cpu 1 then pulls task A (perhaps due to idle balance) - cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE - cpu 0 continues in try_to_wake_up(), thinking task A's previous cpu is 0, where it is actually 1 - if select_task_rq returns cpu 0, task A will be woken up on cpu 0 without properly updating its cpu to 0 in set_task_cpu() CRs-Fixed: 665958 Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99 Signed-off-by: Steve Muckle <[email protected]> sched: Fix reference to stale task_struct in try_to_wake_up() try_to_wake_up() currently drops p->pi_lock and later checks for need to notify cpufreq governor on task migrations or wakeups. However the woken task could exit between the time p->pi_lock is released and the time the test for notification is run. As a result, the test for notification could refer to an exited task. task_notify_on_migrate(p) could thus lead to invalid memory reference. Fix this by running the test for notification with task's pi_lock held. Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1 Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Bail out of yield_to when source and target runqueue has one task In case of undercomitted scenarios, especially in large guests yield_to overhead is significantly high. when run queue length of source and target is one, take an opportunity to bail out and return -ESRCH. This return condition can be further exploited to quickly come out of PLE handler. (History: Raghavendra initially worked on break out of kvm ple handler upon seeing source runqueue length = 1, but it had to export rq length). Peter came up with the elegant idea of return -ESRCH in scheduler core. Signed-off-by: Peter Zijlstra <[email protected]> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi) Reviewed-by: Srikar Dronamraju <[email protected]> Signed-off-by: Raghavendra K T <[email protected]> Acked-by: Andrew Jones <[email protected]> Tested-by: Chegu Vinod <[email protected]> Signed-off-by: Gleb Natapov <[email protected]> Change-Id: I4eca8afc683c1a4b9d2490422baacd3da76c1e92 sched: Fix select_idle_sibling() bouncing cow syndrome If the previous CPU is cache affine and idle, select it. The current implementation simply traverses the sd_llc domain, taking the first idle CPU encountered, which walks buddy pairs hand in hand over the package, inflicting excruciating pain. 1 tbench pair (worst case) in a 10 core + SMT package: pre 15.22 MB/sec 1 procs post 252.01 MB/sec 1 procs Signed-off-by: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix double normalization of vruntime dequeue_entity() is called when p->on_rq and sets se->on_rq = 0 which appears to guarentee that the !se->on_rq condition is met. If the task has done set_current_state(TASK_INTERRUPTIBLE) without schedule() the second condition will be met and vruntime will be incorrectly adjusted twice. In certain cases this can result in the task's vruntime never increasing past the vruntime of other tasks on the CFS' run queue, starving them of CPU time. This patch changes switched_from_fair() to use !p->on_rq instead of !se->on_rq. I'm able to cause a task with a priority of 120 to starve all other tasks with the same priority on an ARM platform running 3.2.51-rt72 PREEMPT RT by writing one character at time to a serial tty (16550 UART) in a tight loop. I'm also able to verify making this change corrects the problem on that platform and kernel version. Change-Id: I70aa0fe0e9be146b1a2a903d010b318390262dc3 Signed-off-by: George McCollister <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix divide by zero at {thread_group,task}_times On architectures where cputime_t is 64 bit type, is possible to trigger divide by zero on do_div(temp, (__force u32) total) line, if total is a non zero number but has lower 32 bit's zeroed. Removing casting is not a good solution since some do_div() implementations do cast to u32 internally. This problem can be triggered in practice on very long lived processes: PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin" #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2 #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00 #3 [ffff880472a51cd0] die at ffffffff8100f26b #4 [ffff880472a51d00] do_trap at ffffffff814f03f4 #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff #6 [ffff880472a51e00] divide_error at ffffffff8100be7b [exception RIP: thread_group_times+0x56] RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046 RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800 R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08 R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d #8 [ffff880472a51f40] sys_times at ffffffff81088524 #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2 RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202 RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000 RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0 RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940 R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940 ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b Change-Id: I1e0fa1fa38b86566f5b20e5c0efa0c9eb17b203b Signed-off-by: Stanislaw Gruszka <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched_clock: Prevent 64bit inatomicity on 32bit systems The sched_clock_remote() implementation has the following inatomicity problem on 32bit systems when accessing the remote scd->clock, which is a 64bit value. CPU0 CPU1 sched_clock_local() sched_clock_remote(CPU0) ... remote_clock = scd[CPU0]->clock read_low32bit(scd[CPU0]->clock) cmpxchg64(scd->clock,...) read_high32bit(scd[CPU0]->clock) While the update of scd->clock is using an atomic64 mechanism, the readout on the remote cpu is not, which can cause completely bogus readouts. It is a quite rare problem, because it requires the update to hit the narrow race window between the low/high readout and the update must go across the 32bit boundary. The resulting misbehaviour is, that CPU1 will see the sched_clock on CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value to this bogus timestamp. This stays that way due to the clamping implementation for about 4 seconds until the synchronization with CLOCK_MONOTONIC undoes the problem. The issue is hard to observe, because it might only result in a less accurate SCHED_OTHER timeslicing behaviour. To create observable damage on realtime scheduling classes, it is necessary that the bogus update of CPU1 sched_clock happens in the context of an realtime thread, which then gets charged 4 seconds of RT runtime, which results in the RT throttler mechanism to trigger and prevent scheduling of RT tasks for a little less than 4 seconds. So this is quite unlikely as well. The issue was quite hard to decode as the reproduction time is between 2 days and 3 weeks and intrusive tracing makes it less likely, but the following trace recorded with trace_clock=global, which uses sched_clock_local(), gave the final hint: <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 What happens is that CPU0 goes idle and invokes sched_clock_idle_sleep_event() which invokes sched_clock_local() and CPU1 runs a remote wakeup for CPU0 at the same time, which invokes sched_remote_clock(). The time jump gets propagated to CPU0 via sched_remote_clock() and stays stale on both cores for ~4 seconds. There are only two other possibilities, which could cause a stale sched clock: 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic wrong value. 2) sched_clock() which reads the TSC returns a sporadic wrong value. one jiffy and then go stale. forward. It would just result in a stale sched_clock for one jiffy. After quite some brain twisting and finding the same pattern on other traces, sched_clock_remote() remained the only place which could cause such a problem and as explained above it's indeed racy on 32bit systems. So while on 64bit systems the readout is atomic, we need to verify the remote readout on 32bit machines. We need to protect the local->clock readout in sched_clock_remote() on 32bit as well because an NMI could hit between the low and the high readout, call sched_clock_local() and modify local->clock. Thanks to Siegfried Wulsch for bearing with my debug requests and going through the tedious tasks of running a bunch of reproducer systems to generate the debug information which let me decode the issue. Change-Id: I767a6a645e17651b39d0a530dc0681756f3b4f5d Reported-by: Siegfried Wulsch <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities This patch touches the RT group scheduling case. Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's priority, while rt_rq passed to them may be not the top-level rt_rq. This is wrong, because changing of priority on a child level does not guarantee that the priority is the highest all over the rq. So, this leak makes RT balancing unusable. The short example: the task having the highest priority among all rq's RT tasks (no one other task has the same priority) are waking on a throttle rt_rq. The rq's cpupri is set to the task's priority equivalent, but real rq->rt.highest_prio.curr is less. The patch below fixes the problem. Change-Id: I01d3cc07870b1825efa1deed3b70f060ac20939f Signed-off-by: Kirill Tkhai <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> CC: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Guarantee new group-entities always have weight Currently, group entity load-weights are initialized to zero. This admits some races with respect to the first time they are re-weighted in earlty use. ( Let g[x] denote the se for "g" on cpu "x". ) Suppose that we have root->a and that a enters a throttled state, immediately followed by a[0]->t1 (the only task running on cpu[0]) blocking: put_prev_task(group_cfs_rq(a[0]), t1) put_prev_entity(..., t1) check_cfs_rq_runtime(group_cfs_rq(a[0])) throttle_cfs_rq(group_cfs_rq(a[0])) Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first time: enqueue_task_fair(rq[0], t2) enqueue_entity(group_cfs_rq(b[0]), t2) enqueue_entity_load_avg(group_cfs_rq(b[0]), t2) account_entity_enqueue(group_cfs_ra(b[0]), t2) update_cfs_shares(group_cfs_rq(b[0])) < skipped because b is part of a throttled hierarchy > enqueue_entity(group_cfs_rq(a[0]), b[0]) ... We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0 which violates invariants in several code-paths. Eliminate the possibility of this by initializing group entity weight. Change-Id: I1b09c5c4f85df1de565e7c862ba8dde65851c05f Signed-off-by: Paul Turner <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <[email protected]> Cc: Chris J Arges <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Fix isolated CPUs leaving root_task_group indefinitely throttled Root task group bandwidth replenishment must service all CPUs, regardless of where the timer was last started, and regardless of the isolation mechanism, lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy. Change-Id: I8485724e3ff0960008aa13634e55bcb6ceacbc01 Signed-off-by: Mike Galbraith <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Avoid updating RT entry timeout twice within one tick period The issue below was found in 2.6.34-rt rather than mainline rt kernel, but the issue still exists upstream as well. So please let me describe how it was noticed on 2.6.34-rt: On this version, each softirq has its own thread, which means there is at least one RT FIFO task per cpu. The priority of these tasks is set to 49 by default. If user launches an RT FIFO task with priority lower than 49 of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued one cpu runqueue at one moment. By current strategy of balancing RT tasks, when it comes to RT tasks, we really need to put them off to a CPU that they can run on as soon as possible. Even if it means a bit of cache line flushing, we want RT tasks to be run with the least latency. When the user RT FIFO task which just launched before is running, the sched timer tick of the current cpu happens. In this tick period, the timeout value of the user RT task will be updated once. Subsequently, we try to wake up one softirq RT task on its local cpu. As the priority of current user RT task is lower than the softirq RT task, the current task will be preempted by the higher priority softirq RT task. Before preemption, we check to see if current can readily move to a different cpu. If so, we will reschedule to allow the RT push logic to try to move current somewhere else. Whenever the woken softirq RT task runs, it first tries to migrate the user FIFO RT task over to a cpu that is running a task of lesser priority. If migration is done, it will send a reschedule request to the found cpu by IPI interrupt. Once the target cpu responds the IPI interrupt, it will pick the migrated user RT task to preempt its current task. When the user RT task is running on the new cpu, the sched timer tick of the cpu fires. So it will tick the user RT task again. This also means the RT task timeout value will be updated again. As the migration may be done in one tick period, it means the user RT task timeout value will be updated twice within one tick. If we set a limit on the amount of cpu time for the user RT task by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon reaching the soft limit. But exactly when the SIGXCPU signal should be sent depends on the RT task timeout value. In fact the timeout mechanism of sending the SIGXCPU signal assumes the RT task timeout is increased once every tick. However, currently the timeout value may be added twice per tick. So it results in the SIGXCPU signal being sent earlier than expected. To solve this issue, we prevent the timeout value from increasing twice within one tick time by remembering the jiffies value of last updating the timeout. As long as the RT task's jiffies is different with the global jiffies value, we allow its timeout to be updated. Signed-off-by: Ying Xue <[email protected]> Signed-off-by: Fan Du <[email protected]> Reviewed-by: Yong Zhang <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> [ lizf: backported to 3.4: adjust context ] Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Change-Id: I1d126959d88f3bc1adb4a707661efb7162048424 sched: rt: Reduce rq lock contention by eliminating locking of non-feasible target This patch adds checks that prevens futile attempts to move rt tasks to a CPU with active tasks of equal or higher priority. This reduces run queue lock contention and improves the performance of a well known OLTP benchmark by 0.7%. Change-Id: I232abe50623593d51be508fc64e8b865dcc77212 Signed-off-by: Tim Chen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Shawn Bohrer <[email protected]> Cc: Suruchi Kadu <[email protected]> Cc: Doug Nelson<[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Zefan Li <[email protected]> sched: Fix RLIMIT_RTTIME when PI-boosting to RT When non-realtime tasks get priority-inheritance boosted to a realtime scheduling class, RLIMIT_RTTIME starts to apply to them. However, the counter used for checking this (the same one used for SCHED_RR timeslices) was not getting reset. This meant that tasks running with a non-realtime scheduling class which are repeatedly boosted to a realtime one, but never block while they are running realtime, eventually hit the timeout without ever running for a time over the limit. This patch resets the realtime timeslice counter when un-PI-boosting from an RT to a non-RT scheduling class. I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT mutex which induces priority boosting and spins while boosted that gets killed by a SIGXCPU on non-fixed kernels but doesn't with this patch applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and does happen eventually with PREEMPT_VOLUNTARY kernels. Change-Id: I732036119517d9a9625cea4644b4509c1d6cdf13 Signed-off-by: Brian Silverman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> [lizf: Backported to 3.4: adjust contest] Signed-off-by: Zefan Li <[email protected]> sched: Queue RT tasks to head when prio drops The following scenario does not work correctly: Runqueue of CPUx contains two runnable and pinned tasks: T1: SCHED_FIFO, prio 80 T2: SCHED_FIFO, prio 80 T1 is on the cpu and executes the following syscalls (classic priority ceiling scenario): sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90); ... sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80); ... Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back to sleep the scheduler picks T2. Surprise! The same happens w/o actual preemption when T1 is forced into the scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes pick_next_task() which returns T2. So T1 gets preempted and scheduled out. This happens because sched_setscheduler() dequeues T1 from the prio 90 list and then enqueues it on the tail of the prio 80 list behind T2. This violates the POSIX spec and surprises user space which relies on the guarantee that SCHED_FIFO tasks are not scheduled out unless they give the CPU up voluntarily or are preempted by a higher priority task. In the latter case the preempted task must get back on the CPU after the preempting task schedules out again. We fixed a similar issue already in commit 60db48c (sched: Queue a deboosted task to the head of the RT prio queue). The same treatment is necessary for sched_setscheduler(). So enqueue to head of the prio bucket list if the priority of the task is lowered. It might be possible that existing user space relies on the current behaviour, but it can be considered highly unlikely due to the corner case nature of the application scenario. Change-Id: I65f159b3c2d265dcaf27b70bf9be88cf922ab845 Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Zefan Li <[email protected]> sched: Fix the theoretical signal_wake_up() vs schedule() race This is only theoretical, but after try_to_wake_up(p) was changed to check p->state under p->pi_lock the code like: __set_current_state(TASK_INTERRUPTIBLE); schedule(); can miss a signal. This is the special case of wait-for-condition, it relies on try_to_wake_up/schedule interaction and thus it does not need mb() between __set_current_state() and if (signal_pending). However, this __set_current_state() can move into the critical section protected by rq->lock, now that try_to_wake_up() takes another lock we need to ensure that it can't be reordered with "if (signal_pending(current))" check inside that section. The patch is actually one-liner, it simply adds smp_wmb() before spin_lock_irq(rq->lock). This is what try_to_wake_up() already does by the same reason. We turn this wmb() into the new helper, smp_mb__before_spinlock(), for better documentation and to allow the architectures to change the default implementation. While at it, kill smp_mb__after_lock(), it has no callers. Perhaps we can also add smp_mb__before/after_spinunlock() for prepare_to_wait(). Change-Id: I2b329b2ae213cc9221811ca244d2ebb044011a1b Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix race in idle_balance() The scheduler main function 'schedule()' checks if there are no more tasks on the runqueue. Then it checks if a task should be pulled in the current runqueue in idle_balance() assuming it will go to idle otherwise. But idle_balance() releases the rq->lock in order to look up the sched domains and takes the lock again right after. That opens a window where another cpu may put a task in our runqueue, so we won't go to idle but we have filled the idle_stamp, thinking we will. This patch closes the window by checking if the runqueue has been modified but without pulling a task after taking the lock again, so we won't go to idle right after in the __schedule() function. Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8 Signed-off-by: Daniel Lezcano <[email protected]> Cc: [email protected] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix load avg vs cpu-hotplug Rabik and Paul reported two different issues related to the same few lines of code. Rabik's issue is that the nr_uninterruptible migration code is wrong in that he sees artifacts due to this (Rabik please do expand in more detail). Paul's issue is that this code as it stands relies on us using stop_machine() for unplug, we all would like to remove this assumption so that eventually we can remove this stop_machine() usage altogether. The only reason we'd have to migrate nr_uninterruptible is so that we could use for_each_online_cpu() loops in favour of for_each_possible_cpu() loops, however since nr_uninterruptible() is the only such loop and its using possible lets not bother at all. The problem Rabik sees is (probably) caused by the fact that by migrating nr_uninterruptible we screw rq->calc_load_active for both rqs involved. So don't bother with fancy migration schemes (meaning we now have to keep using for_each_possible_cpu()) and instead fold any nr_active delta after we migrate all tasks away to make sure we don't have any skewed nr_active accounting. Reported-by: Rakib Mullick <[email protected]> Reported-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins Signed-off-by: Ingo Molnar <[email protected]> sched: Fix load avg vs. cpu-hotplug Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an incomplete fix: In particular, the problem is that at the point it calls calc_load_migrate() nr_running := 1 (the stopper thread), so move the call to CPU_DEAD where we're sure that nr_running := 0. Also note that we can call calc_load_migrate() without serialization, we know the state of rq is stable since its cpu is dead, and we modify the global state using appropriate atomic ops. Change-Id: I24063eebf4429d7e01838af2cb68e0f3c2cf2131 Suggested-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins Signed-off-by: Ingo Molnar <[email protected]> sched/nohz: Rewrite and fix load-avg computation -- again Thanks to Charles Wang for spotting the defects in the current code: - If we go idle during the sample window -- after sampling, we get a negative bias because we can negate our own sample. - If we wake up during the sample window we get a positive bias because we push the sample to a known active period. So rewrite the entire nohz load-avg muck once again, now adding copious documentation to the code. Reported-and-tested-by: Doug Smythies <[email protected]> Reported-and-tested-by: Charles Wang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins [ minor edits ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Change-Id: Ib735002c79eaace9fa5f7795fdc2f80b638b56a7 sched: Fix the broken sched_rr_get_interval() The caller of sched_sliced() should pass se.cfs_rq and se as the arguments, however in sched_rr_get_interval() we gave it rq.cfs_rq and se, which made the following computation obviously wrong. The change was introduced by commit: 77034937dc45 sched: fix crash in sys_sched_rr_get_interval() ... 5 years ago, while it had been the correct 'cfs_rq_of' before the commit. The change seems to be irrelevant to the commit msg, which was to return a 0 timeslice for tasks that are on an idle runqueue. So I believe that was just a plain typo. Signed-off-by: Zhu Yanhai <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] [ Since this is an ABI and an old bug, we'll test this via a slow upstream route, to hopefully discover any app breakage. ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix race in task_group() Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call task_group() too many times in set_task_rq()"). He found the reason to be that the multiple task_group() invocations in set_task_rq() returned different values. Looking at all that I found a lack of serialization and plain wrong comments. The below tries to fix it using an extra pointer which is updated under the appropriate scheduler locks. Its not pretty, but I can't really see another way given how all the cgroup stuff works. Change-Id: I6906c92ea6f66d19af5938a9f5ec9f1ff4577c39 Reported-and-tested-by: Stefan Bader <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed masks as and when necessary to ensure that the tasks belonging to the cpusets have some place (online CPUs) to run on. And regular CPU hotplug is destructive in the sense that the kernel doesn't remember the original cpuset configurations set by the user, across hotplug operations. However, suspend/resume (which uses CPU hotplug) is a special case in which the kernel has the responsibility to restore the system (during resume), to exactly the same state it was in before suspend. In order to achieve that, do the following: 1. Don't modify cpusets during suspend/resume. At all. In particular, don't move the tasks from one cpuset to another, and don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets during the CPU hotplug operations that are carried out in the suspend/resume path. 2. However, cpusets and sched domains are related. We just want to avoid altering cpusets alone. So, to keep the sched domains updated, build a single sched domain (containing all active cpus) during each of the CPU hotplug operations carried out in s/r path, effectively ignoring the cpusets' cpus_allowed masks. (Since userspace is frozen while doing all this, it will go unnoticed.) 3. During the last CPU online operation during resume, build the sched domains by looking up the (unaltered) cpusets' cpus_allowed masks. That will bring back the system to the same original state as it was in before suspend. Ultimately, this will not only solve the cpuset problem related to suspend resume (ie., restores the cpusets to exactly what it was before suspend, by not touching it at all) but also speeds up suspend/resume because we avoid running cpuset update code for every CPU being offlined/onlined. Change-Id: I12c3e6289d5aa069531f8f616d63412b38103d99 Signed-off-by: Srivatsa S. Bhat <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Preeti U Murthy <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones There is a small race between copy_process() and cgroup_attach_task() where child->se.parent,cfs_rq points to invalid (old) ones. parent doing fork() | someone moving the parent to another cgroup -------------------------------+--------------------------------------------- copy_process() + dup_task_struct() -> parent->se is copied to child->se. se.parent,cfs_rq of them point to old ones. cgroup_attach_task() + cgroup_task_migrate() -> parent->cgroup is updated. + cpu_cgroup_attach() + sched_move_task() + task_move_group_fair() +- set_task_rq() -> se.parent,cfs_rq of parent are updated. + cgroup_fork() -> parent->cgroup is copied to child->cgroup. (*1) + sched_fork() + task_fork_fair() -> se.parent,cfs_rq of child are accessed while they point to old ones. (*2) In the worst case, this bug can lead to "use-after-free" and cause a panic, because it's new cgroup's refcount that is incremented at (*1), so the old cgroup(and related data) can be freed before (*2). In fact, a panic caused by this bug was originally caught in RHEL6.4. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0 [...] Call Trace: [<ffffffff81051f25>] place_entity+0x75/0xa0 [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160 [<ffffffff81063c0b>] sched_fork+0x6b/0x140 [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450 [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130 [<ffffffff8106d2f4>] do_fork+0x94/0x460 [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100 [<ffffffff81009598>] sys_clone+0x28/0x30 [<ffffffff8100b393>] stub_clone+0x13/0x20 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Daisuke Nishimura <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/nohz: Fix rq->cpu_load[] calculations While investigating why the load-balancer did funny I found that the rq->cpu_load[] tables were completely screwy.. a bit more digging revealed that the updates that got through were missing ticks followed by a catchup of 2 ticks. The catchup assumes the cpu was idle during that time (since only nohz can cause missed ticks and the machine is idle etc..) this means that esp. the higher indices were significantly lower than they ought to be. The reason for this is that its not correct to compare against jiffies on every jiffy on any other cpu than the cpu that updates jiffies. This patch cludges around it by only doing the catch-up stuff from nohz_idle_balance() and doing the regular stuff unconditionally from the tick. Change-Id: Iac6eda1ded5074afb261ccc0fe8a4e70432bb031 Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Venkatesh Pallipadi <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/nohz: Fix rq->cpu_load calculations some more commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream. Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[] calculations") since while that fixed the busy case it regressed the mostly idle case. Add a callback from the nohz exit to also age the rq->cpu_load[] array. This closes the hole where either there was no nohz load balance pass during the nohz, or there was a 'significant' amount of idle time between the last nohz balance and the nohz exit. So we'll update unconditionally from the tick to not insert any accidental 0 load periods while busy, and we try and catch up from nohz idle balance and nohz exit. Both these are still prone to missing a jiffy, but that has always been the case. Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Venkatesh Pallipadi <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/core: Clear the root_domain cpumasks in init_rootdomain() root_domain::rto_mask allocated through alloc_cpumask_var() contains garbage data, and this may cause problems. For instance, when doing pull_rt_task(), it may do useless iterations if rto_mask retains some extra garbage bits. Worse still, this violates the isolated domain rule for clustered scheduling using cpuset, because the tasks (with all the cpus allowed) belongs to one root domain can be pulled away into another root domain. The patch cleans the garbage by using zalloc_cpumask_var() instead of alloc_cpumask_var() for root_domain::rto_mask allocation, thereby addressing the issues. Do the same thing for root_domain's other cpumask memembers: - span, - online; Signed-off-by: Xunlei Pang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> softirq: Reduce latencies In various network workloads, __do_softirq() latencies can be up to 20 ms if HZ=1000, and 200 ms if HZ=100. This is because we iterate 10 times in the softirq dispatcher, and some actions can consume a lot of cycles. This patch changes the fallback to ksoftirqd condition to: - A time limit of 2 ms, - need_resched() being set on current task; When one of this condition is met, we wakeup ksoftirqd for further softirq processing if we still have pending softirqs. Using need_resched() as the only condition can trigger RCU stalls, as we can keep BH disabled for too long. I ran several benchmarks and got no significant difference in throughput, but a very significant reduction of latencies (one order of magnitude): In following bench, 200 antagonist "netperf -t TCP_RR" are started in background, using all available cpus. Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC IRQ (hard + soft). Before patch: RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=550110.424 MIN_LATENCY=146858 MAX_LATENCY=997109 P50_LATENCY=305000 P90_LATENCY=550000 P99_LATENCY=710000 MEAN_LATENCY=376989.12 STDDEV_LATENCY=184046.92 After patch: RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=40545.492 MIN_LATENCY=9834 MAX_LATENCY=78366 P50_LATENCY=33583 P90_LATENCY=59000 P99_LATENCY=69000 MEAN_LATENCY=38364.67 STDDEV_LATENCY=12865.26 Signed-off-by: Eric Dumazet <[email protected]> Cc: David Miller <[email protected]> Cc: Tom Herbert <[email protected]> Cc: Ben Hutchings <[email protected]> Signed-off-by: David S. Miller <[email protected]> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <[email protected]> Signed-off-by: Zefan Li <[email protected]> softirq: Fix lockup related to stop_machine being stuck in __do_softirq. The stop machine logic can lock up if all but one of the migration threads make it through the disable-irq step and the one remaining thread gets stuck in __do_softirq. The reason __do_softirq can hang is that it has a bail-out based on jiffies timeout, but in the lockup case, jiffies itself is not incremented. To work around this, re-add the max_restart counter in __do_irq and stop processing irqs after 10 restarts. Thanks to Tejun Heo and Rusty Russell and others for helping me track this down. This was introduced in 3.9 by commit c10d73671ad3 ("softirq: reduce latencies"). It may be worth looking into ath9k to see if it has issues with its irq handler at a later date. The hang stack traces look something like this: ------------[ cut here ]------------ WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7() Watchdog detected hard LOCKUP on cpu 2 Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] Pid: 23, comm: migration/2 Tainted: G C 3.9.4+ #11 Call Trace: <NMI> warn_slowpath_common+0x85/0x9f warn_slowpath_fmt+0x46/0x48 watchdog_overflow_callback+0x9c/0xa7 __perf_event_overflow+0x137/0x1cb perf_event_overflow+0x14/0x16 intel_pmu_handle_irq+0x2dc/0x359 perf_event_nmi_handler+0x19/0x1b nmi_handle+0x7f/0xc2 do_nmi+0xbc/0x304 end_repeat_nmi+0x1e/0x2e <<EOE>> cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 ---[ end trace 4947dfa9b0a4cec3 ]--- BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17] Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] irq event stamp: 835637905 hardirqs last enabled at (835637904): __do_softirq+0x9f/0x257 hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80 softirqs last enabled at (5654720): __do_softirq+0x1ff/0x257 softirqs last disabled at (5654725): irq_exit+0x5f/0xbb CPU 1 Pid: 17, comm: migration/1 Tainted: G WC 3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: tasklet_hi_action+0xf0/0xf0 Process migration/1 Call Trace: <IRQ> __do_softirq+0x117/0x257 irq_exit+0x5f/0xbb smp_apic_timer_interrupt+0x8a/0x98 apic_timer_interrupt+0x72/0x80 <EOI> printk+0x4d/0x4f stop_machine_cpu_stop+0x22c/0x274 cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 Change-Id: Ib213d5a5c59c85ae16afde6f42a01e39d2c93fd7 Signed-off-by: Ben Greear <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Pekka Riikonen <[email protected]> Cc: Eric Dumazet <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <[email protected]> Signed-off-by: Zefan Li <[email protected]>
alesaiko
pushed a commit
that referenced
this issue
Mar 13, 2018
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior of waking their tasks up on idle CPUs. The feature has too much of a negative impact on other workloads however to apply globally. The PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this flag set, or tasks woken by tasks with this flag set, on an idle CPU if one is available. Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039 Signed-off-by: Steve Muckle <[email protected]> sched: add sysctl for controlling task migrations on wake The PF_WAKE_UP_IDLE per-task flag made it impossible to enable the old behavior of SD_SHARE_PKG_RESOURCES, where every task migrates to an idle CPU on wakeup. The sched_wake_to_idle sysctl value, when made nonzero, will cause all tasks to migrate to an idle CPU if one is available when the task is woken up. This is regardless of how PF_WAKE_UP_IDLE is configured for tasks in the system. Similar to PF_WAKE_UP_IDLE, the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled for the sysctl value to have an effect. Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61 Signed-off-by: Steve Muckle <[email protected]> (cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282) Conflicts: kernel/sched/fair.c sched: Reset rq->next_interval before going idle next_balance, the point in jiffy time scale when a cpu will next load balance, could have been calculated when the cpu was busy. A busy cpu will apply its sched domain's busy_factor (usually > 1) in computing next_balance for that sched domain, which causes the (busy) cpu to load balance less frequently in its sched domains. However when the same cpu is going idle, its next_balance needs to be reset without consideration of busy_factor. Failure to do so would not trigger nohz idle balancer on that cpu for unnecessarily long time (introducing additional scheduling latencies for tasks). Fix bug in scheduler which aims to reset next_balance before a cpu goes idle (as per existing comment) but is clearly not doing so. Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4 Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Fix SCHED_HRTICK bug leading to late preemption of tasks SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot (just when they would have exceeded their ideal_runtime). It makes use of a a per-cpu hrtimer resource and hence alarming that hrtimer should be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs, rather than being based on number of tasks in a particular cfs_rq (as implemented currently). As a result, with current code, its possible for a running task (which is the sole task in its cfs_rq) to be preempted much after its ideal_runtime has elapsed, resulting in increased latency for tasks in other cfs_rq on same cpu. Fix this by alarming sched hrtimer based on total number of SCHED_FAIR tasks a CPU has across its various cfs_rqs. Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Mark schedule_io_timeout() with EXPORT_SYMBOL Make schedule_io_timeout() visible to modules. Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2 Signed-off-by: Jordan Crouse <[email protected]> sched: fix reference to wrong cfs_rq Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption of tasks) introduced a bug in sched_slice() calculation by using wrong cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather than the correct one to which they belonged. Fix the bug by using correct cfs_rq for tasks. Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Make sure to not re-read variables after validation We could re-read rq->rt_avg after we validated it was smaller than total, invalidating the check and resulting in an unintended negative. Signed-off-by: Peter Zijlstra <[email protected]> Cc: David Rientjes <[email protected]> Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins Signed-off-by: Ingo Molnar <[email protected]> CRs-Fixed: 497236 Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Srivatsa Vaddagiri <[email protected]> (cherry picked from commit fef154fa32c5d8ef992f11e11919495979970678) Change-Id: Ieac7a10d9f4ceb701660f8055efd7a0ed5d939ca Signed-off-by: Sridhar Nelloru <[email protected]> sched: re-calculate a cpu's next_balance point upon sched domain changes Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset a cpu's rq->next_balance when pulled_task = 0, which will be true when the cpu failed to pull any task, causing it go idle. However that patch relied on next_balance being calculated as a result of traversing cpu's sched domain hierarchy. A cpu that is the only online cpu will however not be attached to any sched domain hierarchy. When such a cpu calls into idle_balance(), we will end up initializing next_balance to be 1sec away! Such a CPU will defer load balance check for another 1sec, even though we may bring up more cpus in the meantime requiring it to check for load imbalance more frequently. This could then lead to increased scheduling latency for some tasks. This patch results in a cpu's next_balance being re-calculated when its attaching to a new sched domain hierarchy. This should let cpus call load balance checks at the right time we expect them to! Signed-off-by: Srivatsa Vaddagiri <[email protected]> (cherry picked from commit 91b609bdd91513c37c410ffef0204f8a0578e4f8) Change-Id: Ib457f1a357aad26e49837a01311b73e22e0cc26f Signed-off-by: Sridhar Nelloru <[email protected]> sched: Unthrottle rt runqueues in __disable_runtime() migrate_tasks() uses _pick_next_task_rt() to get tasks from the real-time runqueues to be migrated. When rt_rq is throttled _pick_next_task_rt() won't return anything, in which case migrate_tasks() can't move all threads over and gets stuck in an infinite loop. Instead unthrottle rt runqueues before migrating tasks. Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair() Signed-off-by: Peter Boonstoppel <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix clear NOHZ_BALANCE_KICK I have faced a sequence where the Idle Load Balance was sometime not triggered for a while on my platform, in the following scenario: CPU 0 and CPU 1 are running tasks and CPU 2 is idle CPU 1 kicks the Idle Load Balance CPU 1 selects CPU 2 as the new Idle Load Balancer CPU 2 sets NOHZ_BALANCE_KICK for CPU 2 CPU 2 sends a reschedule IPI to CPU 2 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance task A quickly goes back to sleep (before a tick occurs on CPU 2) CPU 2 goes back to idle with NOHZ_BALANCE_KICK set Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be performed. We must wait for the sched softirq to be raised on CPU 2 thanks to another part the kernel to come back to clear NOHZ_BALANCE_KICK. The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if we can't raise the sched_softirq for the Idle Load Balance. Change since V1: - move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB can't run on this CPU (as suggested by Peter) Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 873b4c65b519fd769940eb281f77848227d4e5c1 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [[email protected]: minor merge resolution for 3.4 in scheduler_ipi()] Signed-off-by: Steve Muckle <[email protected]> Change-Id: I3548612057cccc2ecc29429c129c44183083831f sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s try_to_wake_up_local() should only be invoked to wake up another task in the same runqueue and BUG_ON()s are used to enforce the rule. Missing try_to_wake_up_local() can stall workqueue execution but such stalls are likely to be finite either by another work item being queued or the one blocked getting unblocked. There's no reason to trigger BUG while holding rq lock crashing the whole system. Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s. Change-Id: I75fdaaf4dcaefcf3893e0404d98c8aa91f89934d Signed-off-by: Tejun Heo <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Git-commit: 383efcd00053ec40023010ce5034bd702e7ab373 CRs-Fixed: 519942 Signed-off-by: Steve Muckle <[email protected]> sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local() The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local() were recently converted from BUG_ON() calls. If these hit it indicates something is wrong and that may contribute to other system instability. To eliminate the risk of an instance of one of these errors going un-noticed because there was an earlier instance that occured long ago, change to WARN_ON(). If there ever is a flood of these there are bigger problems. Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7 Signed-off-by: Steve Muckle <[email protected]> sched: Convert WARN_ON() to printk_sched() in try_to_wake_up_local() try_to_wake_up_local() is called with the rq lock held. Printing to console in this context can result in a deadlock if klogd needs to be woken up. Print to the kernel log buffer via printk_sched() instead which avoids the wakeup. Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c Signed-off-by: Syed Rameez Mustafa <[email protected]> sched/rt: Use root_domain of rt_rq not current processor When the system has multiple domains do_sched_rt_period_timer() can run on any CPU and may iterate over all rt_rq in cpu_online_mask. This means when balance_runtime() is run for a given rt_rq that rt_rq may be in a different rd than the current processor. Thus if we use smp_processor_id() to get rd in do_balance_runtime() we may borrow runtime from a rt_rq that is not part of our rd. This changes do_balance_runtime to get the rd from the passed in rt_rq ensuring that we borrow runtime only from the correct rd for the given rt_rq. This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime when we try reclaim runtime lent to other rt_rq but runtime has been lent to a rt_rq in another rd. Change-Id: Id8b9f88bdd783ef87bfa8d288a2fe7e86dd0b1e2 Signed-off-by: Shawn Bohrer <[email protected]> Acked-by: Steven Rostedt <[email protected]> Acked-by: Mike Galbraith <[email protected]> Cc: [email protected] Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 89960feebaf4f9a53f93a0ce6888207e4a808799 Git-repo: git://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git Signed-off-by: Syed Rameez Mustafa <[email protected]> sched: Remove redundant update_runtime notifier migration_call() will do all the things that update_runtime() does. So let's remove it. Furthermore, there is potential risk that the current code will catch BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime threads running because of enabling runtime twice while the rt_runtime may already changed. Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8 Signed-off-by: Neil Zhang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [[email protected]: resolved trivial header file context conflict] Signed-off-by: Matt Wagantall <[email protected]> sched: Fix hotplug vs. set_cpus_allowed_ptr() Lai found that: WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b() ... migration_cpu_stop+0x1d/0x22 was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is always a sub-set of cpu_online_mask. This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness"). So set active and online at the same time to avoid this particular problem. CRs-Fixed: 680496 Change-Id: I89ac9b6829acf200072975bc7d028a469167f083 Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness") Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Gautham R. Shenoy <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael wang <[email protected]> Cc: Paul Gortmaker <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Toshi Kani <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 24d52daafcd53f109893ec2faf668c8eeef4a382 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Matt Wagantall <[email protected]> sched: Fix race between try_to_wake_up() and move_task() Until a task's state has been seen as interruptible/uninterruptible and it is no longer on_cpu, it is possible that the task may move to another CPU (load balancing may cause this). Here is an example where the race condition results in incorrect operation: - cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING - cpu 0 runs task B, which attempts to wake up A - cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0 - cpu 1 then pulls task A (perhaps due to idle balance) - cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE - cpu 0 continues in try_to_wake_up(), thinking task A's previous cpu is 0, where it is actually 1 - if select_task_rq returns cpu 0, task A will be woken up on cpu 0 without properly updating its cpu to 0 in set_task_cpu() CRs-Fixed: 665958 Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99 Signed-off-by: Steve Muckle <[email protected]> sched: Fix reference to stale task_struct in try_to_wake_up() try_to_wake_up() currently drops p->pi_lock and later checks for need to notify cpufreq governor on task migrations or wakeups. However the woken task could exit between the time p->pi_lock is released and the time the test for notification is run. As a result, the test for notification could refer to an exited task. task_notify_on_migrate(p) could thus lead to invalid memory reference. Fix this by running the test for notification with task's pi_lock held. Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1 Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Bail out of yield_to when source and target runqueue has one task In case of undercomitted scenarios, especially in large guests yield_to overhead is significantly high. when run queue length of source and target is one, take an opportunity to bail out and return -ESRCH. This return condition can be further exploited to quickly come out of PLE handler. (History: Raghavendra initially worked on break out of kvm ple handler upon seeing source runqueue length = 1, but it had to export rq length). Peter came up with the elegant idea of return -ESRCH in scheduler core. Signed-off-by: Peter Zijlstra <[email protected]> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi) Reviewed-by: Srikar Dronamraju <[email protected]> Signed-off-by: Raghavendra K T <[email protected]> Acked-by: Andrew Jones <[email protected]> Tested-by: Chegu Vinod <[email protected]> Signed-off-by: Gleb Natapov <[email protected]> Change-Id: I4eca8afc683c1a4b9d2490422baacd3da76c1e92 sched: Fix select_idle_sibling() bouncing cow syndrome If the previous CPU is cache affine and idle, select it. The current implementation simply traverses the sd_llc domain, taking the first idle CPU encountered, which walks buddy pairs hand in hand over the package, inflicting excruciating pain. 1 tbench pair (worst case) in a 10 core + SMT package: pre 15.22 MB/sec 1 procs post 252.01 MB/sec 1 procs Signed-off-by: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix double normalization of vruntime dequeue_entity() is called when p->on_rq and sets se->on_rq = 0 which appears to guarentee that the !se->on_rq condition is met. If the task has done set_current_state(TASK_INTERRUPTIBLE) without schedule() the second condition will be met and vruntime will be incorrectly adjusted twice. In certain cases this can result in the task's vruntime never increasing past the vruntime of other tasks on the CFS' run queue, starving them of CPU time. This patch changes switched_from_fair() to use !p->on_rq instead of !se->on_rq. I'm able to cause a task with a priority of 120 to starve all other tasks with the same priority on an ARM platform running 3.2.51-rt72 PREEMPT RT by writing one character at time to a serial tty (16550 UART) in a tight loop. I'm also able to verify making this change corrects the problem on that platform and kernel version. Change-Id: I70aa0fe0e9be146b1a2a903d010b318390262dc3 Signed-off-by: George McCollister <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix divide by zero at {thread_group,task}_times On architectures where cputime_t is 64 bit type, is possible to trigger divide by zero on do_div(temp, (__force u32) total) line, if total is a non zero number but has lower 32 bit's zeroed. Removing casting is not a good solution since some do_div() implementations do cast to u32 internally. This problem can be triggered in practice on very long lived processes: PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin" #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2 #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00 #3 [ffff880472a51cd0] die at ffffffff8100f26b #4 [ffff880472a51d00] do_trap at ffffffff814f03f4 #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff #6 [ffff880472a51e00] divide_error at ffffffff8100be7b [exception RIP: thread_group_times+0x56] RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046 RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800 R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08 R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d #8 [ffff880472a51f40] sys_times at ffffffff81088524 #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2 RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202 RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000 RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0 RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940 R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940 ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b Change-Id: I1e0fa1fa38b86566f5b20e5c0efa0c9eb17b203b Signed-off-by: Stanislaw Gruszka <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched_clock: Prevent 64bit inatomicity on 32bit systems The sched_clock_remote() implementation has the following inatomicity problem on 32bit systems when accessing the remote scd->clock, which is a 64bit value. CPU0 CPU1 sched_clock_local() sched_clock_remote(CPU0) ... remote_clock = scd[CPU0]->clock read_low32bit(scd[CPU0]->clock) cmpxchg64(scd->clock,...) read_high32bit(scd[CPU0]->clock) While the update of scd->clock is using an atomic64 mechanism, the readout on the remote cpu is not, which can cause completely bogus readouts. It is a quite rare problem, because it requires the update to hit the narrow race window between the low/high readout and the update must go across the 32bit boundary. The resulting misbehaviour is, that CPU1 will see the sched_clock on CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value to this bogus timestamp. This stays that way due to the clamping implementation for about 4 seconds until the synchronization with CLOCK_MONOTONIC undoes the problem. The issue is hard to observe, because it might only result in a less accurate SCHED_OTHER timeslicing behaviour. To create observable damage on realtime scheduling classes, it is necessary that the bogus update of CPU1 sched_clock happens in the context of an realtime thread, which then gets charged 4 seconds of RT runtime, which results in the RT throttler mechanism to trigger and prevent scheduling of RT tasks for a little less than 4 seconds. So this is quite unlikely as well. The issue was quite hard to decode as the reproduction time is between 2 days and 3 weeks and intrusive tracing makes it less likely, but the following trace recorded with trace_clock=global, which uses sched_clock_local(), gave the final hint: <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 What happens is that CPU0 goes idle and invokes sched_clock_idle_sleep_event() which invokes sched_clock_local() and CPU1 runs a remote wakeup for CPU0 at the same time, which invokes sched_remote_clock(). The time jump gets propagated to CPU0 via sched_remote_clock() and stays stale on both cores for ~4 seconds. There are only two other possibilities, which could cause a stale sched clock: 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic wrong value. 2) sched_clock() which reads the TSC returns a sporadic wrong value. one jiffy and then go stale. forward. It would just result in a stale sched_clock for one jiffy. After quite some brain twisting and finding the same pattern on other traces, sched_clock_remote() remained the only place which could cause such a problem and as explained above it's indeed racy on 32bit systems. So while on 64bit systems the readout is atomic, we need to verify the remote readout on 32bit machines. We need to protect the local->clock readout in sched_clock_remote() on 32bit as well because an NMI could hit between the low and the high readout, call sched_clock_local() and modify local->clock. Thanks to Siegfried Wulsch for bearing with my debug requests and going through the tedious tasks of running a bunch of reproducer systems to generate the debug information which let me decode the issue. Change-Id: I767a6a645e17651b39d0a530dc0681756f3b4f5d Reported-by: Siegfried Wulsch <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities This patch touches the RT group scheduling case. Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's priority, while rt_rq passed to them may be not the top-level rt_rq. This is wrong, because changing of priority on a child level does not guarantee that the priority is the highest all over the rq. So, this leak makes RT balancing unusable. The short example: the task having the highest priority among all rq's RT tasks (no one other task has the same priority) are waking on a throttle rt_rq. The rq's cpupri is set to the task's priority equivalent, but real rq->rt.highest_prio.curr is less. The patch below fixes the problem. Change-Id: I01d3cc07870b1825efa1deed3b70f060ac20939f Signed-off-by: Kirill Tkhai <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> CC: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Guarantee new group-entities always have weight Currently, group entity load-weights are initialized to zero. This admits some races with respect to the first time they are re-weighted in earlty use. ( Let g[x] denote the se for "g" on cpu "x". ) Suppose that we have root->a and that a enters a throttled state, immediately followed by a[0]->t1 (the only task running on cpu[0]) blocking: put_prev_task(group_cfs_rq(a[0]), t1) put_prev_entity(..., t1) check_cfs_rq_runtime(group_cfs_rq(a[0])) throttle_cfs_rq(group_cfs_rq(a[0])) Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first time: enqueue_task_fair(rq[0], t2) enqueue_entity(group_cfs_rq(b[0]), t2) enqueue_entity_load_avg(group_cfs_rq(b[0]), t2) account_entity_enqueue(group_cfs_ra(b[0]), t2) update_cfs_shares(group_cfs_rq(b[0])) < skipped because b is part of a throttled hierarchy > enqueue_entity(group_cfs_rq(a[0]), b[0]) ... We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0 which violates invariants in several code-paths. Eliminate the possibility of this by initializing group entity weight. Change-Id: I1b09c5c4f85df1de565e7c862ba8dde65851c05f Signed-off-by: Paul Turner <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <[email protected]> Cc: Chris J Arges <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Fix isolated CPUs leaving root_task_group indefinitely throttled Root task group bandwidth replenishment must service all CPUs, regardless of where the timer was last started, and regardless of the isolation mechanism, lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy. Change-Id: I8485724e3ff0960008aa13634e55bcb6ceacbc01 Signed-off-by: Mike Galbraith <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Avoid updating RT entry timeout twice within one tick period The issue below was found in 2.6.34-rt rather than mainline rt kernel, but the issue still exists upstream as well. So please let me describe how it was noticed on 2.6.34-rt: On this version, each softirq has its own thread, which means there is at least one RT FIFO task per cpu. The priority of these tasks is set to 49 by default. If user launches an RT FIFO task with priority lower than 49 of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued one cpu runqueue at one moment. By current strategy of balancing RT tasks, when it comes to RT tasks, we really need to put them off to a CPU that they can run on as soon as possible. Even if it means a bit of cache line flushing, we want RT tasks to be run with the least latency. When the user RT FIFO task which just launched before is running, the sched timer tick of the current cpu happens. In this tick period, the timeout value of the user RT task will be updated once. Subsequently, we try to wake up one softirq RT task on its local cpu. As the priority of current user RT task is lower than the softirq RT task, the current task will be preempted by the higher priority softirq RT task. Before preemption, we check to see if current can readily move to a different cpu. If so, we will reschedule to allow the RT push logic to try to move current somewhere else. Whenever the woken softirq RT task runs, it first tries to migrate the user FIFO RT task over to a cpu that is running a task of lesser priority. If migration is done, it will send a reschedule request to the found cpu by IPI interrupt. Once the target cpu responds the IPI interrupt, it will pick the migrated user RT task to preempt its current task. When the user RT task is running on the new cpu, the sched timer tick of the cpu fires. So it will tick the user RT task again. This also means the RT task timeout value will be updated again. As the migration may be done in one tick period, it means the user RT task timeout value will be updated twice within one tick. If we set a limit on the amount of cpu time for the user RT task by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon reaching the soft limit. But exactly when the SIGXCPU signal should be sent depends on the RT task timeout value. In fact the timeout mechanism of sending the SIGXCPU signal assumes the RT task timeout is increased once every tick. However, currently the timeout value may be added twice per tick. So it results in the SIGXCPU signal being sent earlier than expected. To solve this issue, we prevent the timeout value from increasing twice within one tick time by remembering the jiffies value of last updating the timeout. As long as the RT task's jiffies is different with the global jiffies value, we allow its timeout to be updated. Signed-off-by: Ying Xue <[email protected]> Signed-off-by: Fan Du <[email protected]> Reviewed-by: Yong Zhang <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> [ lizf: backported to 3.4: adjust context ] Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Change-Id: I1d126959d88f3bc1adb4a707661efb7162048424 sched: rt: Reduce rq lock contention by eliminating locking of non-feasible target This patch adds checks that prevens futile attempts to move rt tasks to a CPU with active tasks of equal or higher priority. This reduces run queue lock contention and improves the performance of a well known OLTP benchmark by 0.7%. Change-Id: I232abe50623593d51be508fc64e8b865dcc77212 Signed-off-by: Tim Chen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Shawn Bohrer <[email protected]> Cc: Suruchi Kadu <[email protected]> Cc: Doug Nelson<[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Zefan Li <[email protected]> sched: Fix RLIMIT_RTTIME when PI-boosting to RT When non-realtime tasks get priority-inheritance boosted to a realtime scheduling class, RLIMIT_RTTIME starts to apply to them. However, the counter used for checking this (the same one used for SCHED_RR timeslices) was not getting reset. This meant that tasks running with a non-realtime scheduling class which are repeatedly boosted to a realtime one, but never block while they are running realtime, eventually hit the timeout without ever running for a time over the limit. This patch resets the realtime timeslice counter when un-PI-boosting from an RT to a non-RT scheduling class. I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT mutex which induces priority boosting and spins while boosted that gets killed by a SIGXCPU on non-fixed kernels but doesn't with this patch applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and does happen eventually with PREEMPT_VOLUNTARY kernels. Change-Id: I732036119517d9a9625cea4644b4509c1d6cdf13 Signed-off-by: Brian Silverman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> [lizf: Backported to 3.4: adjust contest] Signed-off-by: Zefan Li <[email protected]> sched: Queue RT tasks to head when prio drops The following scenario does not work correctly: Runqueue of CPUx contains two runnable and pinned tasks: T1: SCHED_FIFO, prio 80 T2: SCHED_FIFO, prio 80 T1 is on the cpu and executes the following syscalls (classic priority ceiling scenario): sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90); ... sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80); ... Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back to sleep the scheduler picks T2. Surprise! The same happens w/o actual preemption when T1 is forced into the scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes pick_next_task() which returns T2. So T1 gets preempted and scheduled out. This happens because sched_setscheduler() dequeues T1 from the prio 90 list and then enqueues it on the tail of the prio 80 list behind T2. This violates the POSIX spec and surprises user space which relies on the guarantee that SCHED_FIFO tasks are not scheduled out unless they give the CPU up voluntarily or are preempted by a higher priority task. In the latter case the preempted task must get back on the CPU after the preempting task schedules out again. We fixed a similar issue already in commit 60db48c (sched: Queue a deboosted task to the head of the RT prio queue). The same treatment is necessary for sched_setscheduler(). So enqueue to head of the prio bucket list if the priority of the task is lowered. It might be possible that existing user space relies on the current behaviour, but it can be considered highly unlikely due to the corner case nature of the application scenario. Change-Id: I65f159b3c2d265dcaf27b70bf9be88cf922ab845 Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Zefan Li <[email protected]> sched: Fix the theoretical signal_wake_up() vs schedule() race This is only theoretical, but after try_to_wake_up(p) was changed to check p->state under p->pi_lock the code like: __set_current_state(TASK_INTERRUPTIBLE); schedule(); can miss a signal. This is the special case of wait-for-condition, it relies on try_to_wake_up/schedule interaction and thus it does not need mb() between __set_current_state() and if (signal_pending). However, this __set_current_state() can move into the critical section protected by rq->lock, now that try_to_wake_up() takes another lock we need to ensure that it can't be reordered with "if (signal_pending(current))" check inside that section. The patch is actually one-liner, it simply adds smp_wmb() before spin_lock_irq(rq->lock). This is what try_to_wake_up() already does by the same reason. We turn this wmb() into the new helper, smp_mb__before_spinlock(), for better documentation and to allow the architectures to change the default implementation. While at it, kill smp_mb__after_lock(), it has no callers. Perhaps we can also add smp_mb__before/after_spinunlock() for prepare_to_wait(). Change-Id: I2b329b2ae213cc9221811ca244d2ebb044011a1b Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix race in idle_balance() The scheduler main function 'schedule()' checks if there are no more tasks on the runqueue. Then it checks if a task should be pulled in the current runqueue in idle_balance() assuming it will go to idle otherwise. But idle_balance() releases the rq->lock in order to look up the sched domains and takes the lock again right after. That opens a window where another cpu may put a task in our runqueue, so we won't go to idle but we have filled the idle_stamp, thinking we will. This patch closes the window by checking if the runqueue has been modified but without pulling a task after taking the lock again, so we won't go to idle right after in the __schedule() function. Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8 Signed-off-by: Daniel Lezcano <[email protected]> Cc: [email protected] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix load avg vs cpu-hotplug Rabik and Paul reported two different issues related to the same few lines of code. Rabik's issue is that the nr_uninterruptible migration code is wrong in that he sees artifacts due to this (Rabik please do expand in more detail). Paul's issue is that this code as it stands relies on us using stop_machine() for unplug, we all would like to remove this assumption so that eventually we can remove this stop_machine() usage altogether. The only reason we'd have to migrate nr_uninterruptible is so that we could use for_each_online_cpu() loops in favour of for_each_possible_cpu() loops, however since nr_uninterruptible() is the only such loop and its using possible lets not bother at all. The problem Rabik sees is (probably) caused by the fact that by migrating nr_uninterruptible we screw rq->calc_load_active for both rqs involved. So don't bother with fancy migration schemes (meaning we now have to keep using for_each_possible_cpu()) and instead fold any nr_active delta after we migrate all tasks away to make sure we don't have any skewed nr_active accounting. Reported-by: Rakib Mullick <[email protected]> Reported-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins Signed-off-by: Ingo Molnar <[email protected]> sched: Fix load avg vs. cpu-hotplug Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an incomplete fix: In particular, the problem is that at the point it calls calc_load_migrate() nr_running := 1 (the stopper thread), so move the call to CPU_DEAD where we're sure that nr_running := 0. Also note that we can call calc_load_migrate() without serialization, we know the state of rq is stable since its cpu is dead, and we modify the global state using appropriate atomic ops. Change-Id: I24063eebf4429d7e01838af2cb68e0f3c2cf2131 Suggested-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins Signed-off-by: Ingo Molnar <[email protected]> sched/nohz: Rewrite and fix load-avg computation -- again Thanks to Charles Wang for spotting the defects in the current code: - If we go idle during the sample window -- after sampling, we get a negative bias because we can negate our own sample. - If we wake up during the sample window we get a positive bias because we push the sample to a known active period. So rewrite the entire nohz load-avg muck once again, now adding copious documentation to the code. Reported-and-tested-by: Doug Smythies <[email protected]> Reported-and-tested-by: Charles Wang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins [ minor edits ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Change-Id: Ib735002c79eaace9fa5f7795fdc2f80b638b56a7 sched: Fix the broken sched_rr_get_interval() The caller of sched_sliced() should pass se.cfs_rq and se as the arguments, however in sched_rr_get_interval() we gave it rq.cfs_rq and se, which made the following computation obviously wrong. The change was introduced by commit: 77034937dc45 sched: fix crash in sys_sched_rr_get_interval() ... 5 years ago, while it had been the correct 'cfs_rq_of' before the commit. The change seems to be irrelevant to the commit msg, which was to return a 0 timeslice for tasks that are on an idle runqueue. So I believe that was just a plain typo. Signed-off-by: Zhu Yanhai <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] [ Since this is an ABI and an old bug, we'll test this via a slow upstream route, to hopefully discover any app breakage. ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix race in task_group() Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call task_group() too many times in set_task_rq()"). He found the reason to be that the multiple task_group() invocations in set_task_rq() returned different values. Looking at all that I found a lack of serialization and plain wrong comments. The below tries to fix it using an extra pointer which is updated under the appropriate scheduler locks. Its not pretty, but I can't really see another way given how all the cgroup stuff works. Change-Id: I6906c92ea6f66d19af5938a9f5ec9f1ff4577c39 Reported-and-tested-by: Stefan Bader <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed masks as and when necessary to ensure that the tasks belonging to the cpusets have some place (online CPUs) to run on. And regular CPU hotplug is destructive in the sense that the kernel doesn't remember the original cpuset configurations set by the user, across hotplug operations. However, suspend/resume (which uses CPU hotplug) is a special case in which the kernel has the responsibility to restore the system (during resume), to exactly the same state it was in before suspend. In order to achieve that, do the following: 1. Don't modify cpusets during suspend/resume. At all. In particular, don't move the tasks from one cpuset to another, and don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets during the CPU hotplug operations that are carried out in the suspend/resume path. 2. However, cpusets and sched domains are related. We just want to avoid altering cpusets alone. So, to keep the sched domains updated, build a single sched domain (containing all active cpus) during each of the CPU hotplug operations carried out in s/r path, effectively ignoring the cpusets' cpus_allowed masks. (Since userspace is frozen while doing all this, it will go unnoticed.) 3. During the last CPU online operation during resume, build the sched domains by looking up the (unaltered) cpusets' cpus_allowed masks. That will bring back the system to the same original state as it was in before suspend. Ultimately, this will not only solve the cpuset problem related to suspend resume (ie., restores the cpusets to exactly what it was before suspend, by not touching it at all) but also speeds up suspend/resume because we avoid running cpuset update code for every CPU being offlined/onlined. Change-Id: I12c3e6289d5aa069531f8f616d63412b38103d99 Signed-off-by: Srivatsa S. Bhat <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Preeti U Murthy <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones There is a small race between copy_process() and cgroup_attach_task() where child->se.parent,cfs_rq points to invalid (old) ones. parent doing fork() | someone moving the parent to another cgroup -------------------------------+--------------------------------------------- copy_process() + dup_task_struct() -> parent->se is copied to child->se. se.parent,cfs_rq of them point to old ones. cgroup_attach_task() + cgroup_task_migrate() -> parent->cgroup is updated. + cpu_cgroup_attach() + sched_move_task() + task_move_group_fair() +- set_task_rq() -> se.parent,cfs_rq of parent are updated. + cgroup_fork() -> parent->cgroup is copied to child->cgroup. (*1) + sched_fork() + task_fork_fair() -> se.parent,cfs_rq of child are accessed while they point to old ones. (*2) In the worst case, this bug can lead to "use-after-free" and cause a panic, because it's new cgroup's refcount that is incremented at (*1), so the old cgroup(and related data) can be freed before (*2). In fact, a panic caused by this bug was originally caught in RHEL6.4. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0 [...] Call Trace: [<ffffffff81051f25>] place_entity+0x75/0xa0 [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160 [<ffffffff81063c0b>] sched_fork+0x6b/0x140 [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450 [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130 [<ffffffff8106d2f4>] do_fork+0x94/0x460 [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100 [<ffffffff81009598>] sys_clone+0x28/0x30 [<ffffffff8100b393>] stub_clone+0x13/0x20 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Daisuke Nishimura <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/nohz: Fix rq->cpu_load[] calculations While investigating why the load-balancer did funny I found that the rq->cpu_load[] tables were completely screwy.. a bit more digging revealed that the updates that got through were missing ticks followed by a catchup of 2 ticks. The catchup assumes the cpu was idle during that time (since only nohz can cause missed ticks and the machine is idle etc..) this means that esp. the higher indices were significantly lower than they ought to be. The reason for this is that its not correct to compare against jiffies on every jiffy on any other cpu than the cpu that updates jiffies. This patch cludges around it by only doing the catch-up stuff from nohz_idle_balance() and doing the regular stuff unconditionally from the tick. Change-Id: Iac6eda1ded5074afb261ccc0fe8a4e70432bb031 Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Venkatesh Pallipadi <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/nohz: Fix rq->cpu_load calculations some more commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream. Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[] calculations") since while that fixed the busy case it regressed the mostly idle case. Add a callback from the nohz exit to also age the rq->cpu_load[] array. This closes the hole where either there was no nohz load balance pass during the nohz, or there was a 'significant' amount of idle time between the last nohz balance and the nohz exit. So we'll update unconditionally from the tick to not insert any accidental 0 load periods while busy, and we try and catch up from nohz idle balance and nohz exit. Both these are still prone to missing a jiffy, but that has always been the case. Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Venkatesh Pallipadi <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/core: Clear the root_domain cpumasks in init_rootdomain() root_domain::rto_mask allocated through alloc_cpumask_var() contains garbage data, and this may cause problems. For instance, when doing pull_rt_task(), it may do useless iterations if rto_mask retains some extra garbage bits. Worse still, this violates the isolated domain rule for clustered scheduling using cpuset, because the tasks (with all the cpus allowed) belongs to one root domain can be pulled away into another root domain. The patch cleans the garbage by using zalloc_cpumask_var() instead of alloc_cpumask_var() for root_domain::rto_mask allocation, thereby addressing the issues. Do the same thing for root_domain's other cpumask memembers: - span, - online; Signed-off-by: Xunlei Pang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> softirq: Reduce latencies In various network workloads, __do_softirq() latencies can be up to 20 ms if HZ=1000, and 200 ms if HZ=100. This is because we iterate 10 times in the softirq dispatcher, and some actions can consume a lot of cycles. This patch changes the fallback to ksoftirqd condition to: - A time limit of 2 ms, - need_resched() being set on current task; When one of this condition is met, we wakeup ksoftirqd for further softirq processing if we still have pending softirqs. Using need_resched() as the only condition can trigger RCU stalls, as we can keep BH disabled for too long. I ran several benchmarks and got no significant difference in throughput, but a very significant reduction of latencies (one order of magnitude): In following bench, 200 antagonist "netperf -t TCP_RR" are started in background, using all available cpus. Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC IRQ (hard + soft). Before patch: RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=550110.424 MIN_LATENCY=146858 MAX_LATENCY=997109 P50_LATENCY=305000 P90_LATENCY=550000 P99_LATENCY=710000 MEAN_LATENCY=376989.12 STDDEV_LATENCY=184046.92 After patch: RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=40545.492 MIN_LATENCY=9834 MAX_LATENCY=78366 P50_LATENCY=33583 P90_LATENCY=59000 P99_LATENCY=69000 MEAN_LATENCY=38364.67 STDDEV_LATENCY=12865.26 Signed-off-by: Eric Dumazet <[email protected]> Cc: David Miller <[email protected]> Cc: Tom Herbert <[email protected]> Cc: Ben Hutchings <[email protected]> Signed-off-by: David S. Miller <[email protected]> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <[email protected]> Signed-off-by: Zefan Li <[email protected]> softirq: Fix lockup related to stop_machine being stuck in __do_softirq. The stop machine logic can lock up if all but one of the migration threads make it through the disable-irq step and the one remaining thread gets stuck in __do_softirq. The reason __do_softirq can hang is that it has a bail-out based on jiffies timeout, but in the lockup case, jiffies itself is not incremented. To work around this, re-add the max_restart counter in __do_irq and stop processing irqs after 10 restarts. Thanks to Tejun Heo and Rusty Russell and others for helping me track this down. This was introduced in 3.9 by commit c10d73671ad3 ("softirq: reduce latencies"). It may be worth looking into ath9k to see if it has issues with its irq handler at a later date. The hang stack traces look something like this: ------------[ cut here ]------------ WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7() Watchdog detected hard LOCKUP on cpu 2 Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] Pid: 23, comm: migration/2 Tainted: G C 3.9.4+ #11 Call Trace: <NMI> warn_slowpath_common+0x85/0x9f warn_slowpath_fmt+0x46/0x48 watchdog_overflow_callback+0x9c/0xa7 __perf_event_overflow+0x137/0x1cb perf_event_overflow+0x14/0x16 intel_pmu_handle_irq+0x2dc/0x359 perf_event_nmi_handler+0x19/0x1b nmi_handle+0x7f/0xc2 do_nmi+0xbc/0x304 end_repeat_nmi+0x1e/0x2e <<EOE>> cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 ---[ end trace 4947dfa9b0a4cec3 ]--- BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17] Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] irq event stamp: 835637905 hardirqs last enabled at (835637904): __do_softirq+0x9f/0x257 hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80 softirqs last enabled at (5654720): __do_softirq+0x1ff/0x257 softirqs last disabled at (5654725): irq_exit+0x5f/0xbb CPU 1 Pid: 17, comm: migration/1 Tainted: G WC 3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: tasklet_hi_action+0xf0/0xf0 Process migration/1 Call Trace: <IRQ> __do_softirq+0x117/0x257 irq_exit+0x5f/0xbb smp_apic_timer_interrupt+0x8a/0x98 apic_timer_interrupt+0x72/0x80 <EOI> printk+0x4d/0x4f stop_machine_cpu_stop+0x22c/0x274 cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 Change-Id: Ib213d5a5c59c85ae16afde6f42a01e39d2c93fd7 Signed-off-by: Ben Greear <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Pekka Riikonen <[email protected]> Cc: Eric Dumazet <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <[email protected]> Signed-off-by: Zefan Li <[email protected]>
alesaiko
pushed a commit
that referenced
this issue
Mar 13, 2018
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior of waking their tasks up on idle CPUs. The feature has too much of a negative impact on other workloads however to apply globally. The PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this flag set, or tasks woken by tasks with this flag set, on an idle CPU if one is available. Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039 Signed-off-by: Steve Muckle <[email protected]> sched: add sysctl for controlling task migrations on wake The PF_WAKE_UP_IDLE per-task flag made it impossible to enable the old behavior of SD_SHARE_PKG_RESOURCES, where every task migrates to an idle CPU on wakeup. The sched_wake_to_idle sysctl value, when made nonzero, will cause all tasks to migrate to an idle CPU if one is available when the task is woken up. This is regardless of how PF_WAKE_UP_IDLE is configured for tasks in the system. Similar to PF_WAKE_UP_IDLE, the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled for the sysctl value to have an effect. Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61 Signed-off-by: Steve Muckle <[email protected]> (cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282) Conflicts: kernel/sched/fair.c sched: Reset rq->next_interval before going idle next_balance, the point in jiffy time scale when a cpu will next load balance, could have been calculated when the cpu was busy. A busy cpu will apply its sched domain's busy_factor (usually > 1) in computing next_balance for that sched domain, which causes the (busy) cpu to load balance less frequently in its sched domains. However when the same cpu is going idle, its next_balance needs to be reset without consideration of busy_factor. Failure to do so would not trigger nohz idle balancer on that cpu for unnecessarily long time (introducing additional scheduling latencies for tasks). Fix bug in scheduler which aims to reset next_balance before a cpu goes idle (as per existing comment) but is clearly not doing so. Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4 Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Fix SCHED_HRTICK bug leading to late preemption of tasks SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot (just when they would have exceeded their ideal_runtime). It makes use of a a per-cpu hrtimer resource and hence alarming that hrtimer should be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs, rather than being based on number of tasks in a particular cfs_rq (as implemented currently). As a result, with current code, its possible for a running task (which is the sole task in its cfs_rq) to be preempted much after its ideal_runtime has elapsed, resulting in increased latency for tasks in other cfs_rq on same cpu. Fix this by alarming sched hrtimer based on total number of SCHED_FAIR tasks a CPU has across its various cfs_rqs. Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Mark schedule_io_timeout() with EXPORT_SYMBOL Make schedule_io_timeout() visible to modules. Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2 Signed-off-by: Jordan Crouse <[email protected]> sched: fix reference to wrong cfs_rq Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption of tasks) introduced a bug in sched_slice() calculation by using wrong cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather than the correct one to which they belonged. Fix the bug by using correct cfs_rq for tasks. Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Make sure to not re-read variables after validation We could re-read rq->rt_avg after we validated it was smaller than total, invalidating the check and resulting in an unintended negative. Signed-off-by: Peter Zijlstra <[email protected]> Cc: David Rientjes <[email protected]> Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins Signed-off-by: Ingo Molnar <[email protected]> CRs-Fixed: 497236 Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Srivatsa Vaddagiri <[email protected]> (cherry picked from commit fef154fa32c5d8ef992f11e11919495979970678) Change-Id: Ieac7a10d9f4ceb701660f8055efd7a0ed5d939ca Signed-off-by: Sridhar Nelloru <[email protected]> sched: re-calculate a cpu's next_balance point upon sched domain changes Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset a cpu's rq->next_balance when pulled_task = 0, which will be true when the cpu failed to pull any task, causing it go idle. However that patch relied on next_balance being calculated as a result of traversing cpu's sched domain hierarchy. A cpu that is the only online cpu will however not be attached to any sched domain hierarchy. When such a cpu calls into idle_balance(), we will end up initializing next_balance to be 1sec away! Such a CPU will defer load balance check for another 1sec, even though we may bring up more cpus in the meantime requiring it to check for load imbalance more frequently. This could then lead to increased scheduling latency for some tasks. This patch results in a cpu's next_balance being re-calculated when its attaching to a new sched domain hierarchy. This should let cpus call load balance checks at the right time we expect them to! Signed-off-by: Srivatsa Vaddagiri <[email protected]> (cherry picked from commit 91b609bdd91513c37c410ffef0204f8a0578e4f8) Change-Id: Ib457f1a357aad26e49837a01311b73e22e0cc26f Signed-off-by: Sridhar Nelloru <[email protected]> sched: Unthrottle rt runqueues in __disable_runtime() migrate_tasks() uses _pick_next_task_rt() to get tasks from the real-time runqueues to be migrated. When rt_rq is throttled _pick_next_task_rt() won't return anything, in which case migrate_tasks() can't move all threads over and gets stuck in an infinite loop. Instead unthrottle rt runqueues before migrating tasks. Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair() Signed-off-by: Peter Boonstoppel <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix clear NOHZ_BALANCE_KICK I have faced a sequence where the Idle Load Balance was sometime not triggered for a while on my platform, in the following scenario: CPU 0 and CPU 1 are running tasks and CPU 2 is idle CPU 1 kicks the Idle Load Balance CPU 1 selects CPU 2 as the new Idle Load Balancer CPU 2 sets NOHZ_BALANCE_KICK for CPU 2 CPU 2 sends a reschedule IPI to CPU 2 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance task A quickly goes back to sleep (before a tick occurs on CPU 2) CPU 2 goes back to idle with NOHZ_BALANCE_KICK set Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be performed. We must wait for the sched softirq to be raised on CPU 2 thanks to another part the kernel to come back to clear NOHZ_BALANCE_KICK. The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if we can't raise the sched_softirq for the Idle Load Balance. Change since V1: - move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB can't run on this CPU (as suggested by Peter) Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 873b4c65b519fd769940eb281f77848227d4e5c1 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [[email protected]: minor merge resolution for 3.4 in scheduler_ipi()] Signed-off-by: Steve Muckle <[email protected]> Change-Id: I3548612057cccc2ecc29429c129c44183083831f sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s try_to_wake_up_local() should only be invoked to wake up another task in the same runqueue and BUG_ON()s are used to enforce the rule. Missing try_to_wake_up_local() can stall workqueue execution but such stalls are likely to be finite either by another work item being queued or the one blocked getting unblocked. There's no reason to trigger BUG while holding rq lock crashing the whole system. Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s. Change-Id: I75fdaaf4dcaefcf3893e0404d98c8aa91f89934d Signed-off-by: Tejun Heo <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Git-commit: 383efcd00053ec40023010ce5034bd702e7ab373 CRs-Fixed: 519942 Signed-off-by: Steve Muckle <[email protected]> sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local() The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local() were recently converted from BUG_ON() calls. If these hit it indicates something is wrong and that may contribute to other system instability. To eliminate the risk of an instance of one of these errors going un-noticed because there was an earlier instance that occured long ago, change to WARN_ON(). If there ever is a flood of these there are bigger problems. Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7 Signed-off-by: Steve Muckle <[email protected]> sched: Convert WARN_ON() to printk_sched() in try_to_wake_up_local() try_to_wake_up_local() is called with the rq lock held. Printing to console in this context can result in a deadlock if klogd needs to be woken up. Print to the kernel log buffer via printk_sched() instead which avoids the wakeup. Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c Signed-off-by: Syed Rameez Mustafa <[email protected]> sched/rt: Use root_domain of rt_rq not current processor When the system has multiple domains do_sched_rt_period_timer() can run on any CPU and may iterate over all rt_rq in cpu_online_mask. This means when balance_runtime() is run for a given rt_rq that rt_rq may be in a different rd than the current processor. Thus if we use smp_processor_id() to get rd in do_balance_runtime() we may borrow runtime from a rt_rq that is not part of our rd. This changes do_balance_runtime to get the rd from the passed in rt_rq ensuring that we borrow runtime only from the correct rd for the given rt_rq. This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime when we try reclaim runtime lent to other rt_rq but runtime has been lent to a rt_rq in another rd. Change-Id: Id8b9f88bdd783ef87bfa8d288a2fe7e86dd0b1e2 Signed-off-by: Shawn Bohrer <[email protected]> Acked-by: Steven Rostedt <[email protected]> Acked-by: Mike Galbraith <[email protected]> Cc: [email protected] Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 89960feebaf4f9a53f93a0ce6888207e4a808799 Git-repo: git://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git Signed-off-by: Syed Rameez Mustafa <[email protected]> sched: Remove redundant update_runtime notifier migration_call() will do all the things that update_runtime() does. So let's remove it. Furthermore, there is potential risk that the current code will catch BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime threads running because of enabling runtime twice while the rt_runtime may already changed. Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8 Signed-off-by: Neil Zhang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [[email protected]: resolved trivial header file context conflict] Signed-off-by: Matt Wagantall <[email protected]> sched: Fix hotplug vs. set_cpus_allowed_ptr() Lai found that: WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b() ... migration_cpu_stop+0x1d/0x22 was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is always a sub-set of cpu_online_mask. This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness"). So set active and online at the same time to avoid this particular problem. CRs-Fixed: 680496 Change-Id: I89ac9b6829acf200072975bc7d028a469167f083 Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness") Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Gautham R. Shenoy <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael wang <[email protected]> Cc: Paul Gortmaker <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Toshi Kani <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 24d52daafcd53f109893ec2faf668c8eeef4a382 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Matt Wagantall <[email protected]> sched: Fix race between try_to_wake_up() and move_task() Until a task's state has been seen as interruptible/uninterruptible and it is no longer on_cpu, it is possible that the task may move to another CPU (load balancing may cause this). Here is an example where the race condition results in incorrect operation: - cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING - cpu 0 runs task B, which attempts to wake up A - cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0 - cpu 1 then pulls task A (perhaps due to idle balance) - cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE - cpu 0 continues in try_to_wake_up(), thinking task A's previous cpu is 0, where it is actually 1 - if select_task_rq returns cpu 0, task A will be woken up on cpu 0 without properly updating its cpu to 0 in set_task_cpu() CRs-Fixed: 665958 Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99 Signed-off-by: Steve Muckle <[email protected]> sched: Fix reference to stale task_struct in try_to_wake_up() try_to_wake_up() currently drops p->pi_lock and later checks for need to notify cpufreq governor on task migrations or wakeups. However the woken task could exit between the time p->pi_lock is released and the time the test for notification is run. As a result, the test for notification could refer to an exited task. task_notify_on_migrate(p) could thus lead to invalid memory reference. Fix this by running the test for notification with task's pi_lock held. Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1 Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Bail out of yield_to when source and target runqueue has one task In case of undercomitted scenarios, especially in large guests yield_to overhead is significantly high. when run queue length of source and target is one, take an opportunity to bail out and return -ESRCH. This return condition can be further exploited to quickly come out of PLE handler. (History: Raghavendra initially worked on break out of kvm ple handler upon seeing source runqueue length = 1, but it had to export rq length). Peter came up with the elegant idea of return -ESRCH in scheduler core. Signed-off-by: Peter Zijlstra <[email protected]> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi) Reviewed-by: Srikar Dronamraju <[email protected]> Signed-off-by: Raghavendra K T <[email protected]> Acked-by: Andrew Jones <[email protected]> Tested-by: Chegu Vinod <[email protected]> Signed-off-by: Gleb Natapov <[email protected]> Change-Id: I4eca8afc683c1a4b9d2490422baacd3da76c1e92 sched: Fix select_idle_sibling() bouncing cow syndrome If the previous CPU is cache affine and idle, select it. The current implementation simply traverses the sd_llc domain, taking the first idle CPU encountered, which walks buddy pairs hand in hand over the package, inflicting excruciating pain. 1 tbench pair (worst case) in a 10 core + SMT package: pre 15.22 MB/sec 1 procs post 252.01 MB/sec 1 procs Signed-off-by: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix double normalization of vruntime dequeue_entity() is called when p->on_rq and sets se->on_rq = 0 which appears to guarentee that the !se->on_rq condition is met. If the task has done set_current_state(TASK_INTERRUPTIBLE) without schedule() the second condition will be met and vruntime will be incorrectly adjusted twice. In certain cases this can result in the task's vruntime never increasing past the vruntime of other tasks on the CFS' run queue, starving them of CPU time. This patch changes switched_from_fair() to use !p->on_rq instead of !se->on_rq. I'm able to cause a task with a priority of 120 to starve all other tasks with the same priority on an ARM platform running 3.2.51-rt72 PREEMPT RT by writing one character at time to a serial tty (16550 UART) in a tight loop. I'm also able to verify making this change corrects the problem on that platform and kernel version. Change-Id: I70aa0fe0e9be146b1a2a903d010b318390262dc3 Signed-off-by: George McCollister <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix divide by zero at {thread_group,task}_times On architectures where cputime_t is 64 bit type, is possible to trigger divide by zero on do_div(temp, (__force u32) total) line, if total is a non zero number but has lower 32 bit's zeroed. Removing casting is not a good solution since some do_div() implementations do cast to u32 internally. This problem can be triggered in practice on very long lived processes: PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin" #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2 #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00 #3 [ffff880472a51cd0] die at ffffffff8100f26b #4 [ffff880472a51d00] do_trap at ffffffff814f03f4 #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff #6 [ffff880472a51e00] divide_error at ffffffff8100be7b [exception RIP: thread_group_times+0x56] RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046 RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800 R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08 R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d #8 [ffff880472a51f40] sys_times at ffffffff81088524 #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2 RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202 RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000 RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0 RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940 R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940 ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b Change-Id: I1e0fa1fa38b86566f5b20e5c0efa0c9eb17b203b Signed-off-by: Stanislaw Gruszka <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched_clock: Prevent 64bit inatomicity on 32bit systems The sched_clock_remote() implementation has the following inatomicity problem on 32bit systems when accessing the remote scd->clock, which is a 64bit value. CPU0 CPU1 sched_clock_local() sched_clock_remote(CPU0) ... remote_clock = scd[CPU0]->clock read_low32bit(scd[CPU0]->clock) cmpxchg64(scd->clock,...) read_high32bit(scd[CPU0]->clock) While the update of scd->clock is using an atomic64 mechanism, the readout on the remote cpu is not, which can cause completely bogus readouts. It is a quite rare problem, because it requires the update to hit the narrow race window between the low/high readout and the update must go across the 32bit boundary. The resulting misbehaviour is, that CPU1 will see the sched_clock on CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value to this bogus timestamp. This stays that way due to the clamping implementation for about 4 seconds until the synchronization with CLOCK_MONOTONIC undoes the problem. The issue is hard to observe, because it might only result in a less accurate SCHED_OTHER timeslicing behaviour. To create observable damage on realtime scheduling classes, it is necessary that the bogus update of CPU1 sched_clock happens in the context of an realtime thread, which then gets charged 4 seconds of RT runtime, which results in the RT throttler mechanism to trigger and prevent scheduling of RT tasks for a little less than 4 seconds. So this is quite unlikely as well. The issue was quite hard to decode as the reproduction time is between 2 days and 3 weeks and intrusive tracing makes it less likely, but the following trace recorded with trace_clock=global, which uses sched_clock_local(), gave the final hint: <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 What happens is that CPU0 goes idle and invokes sched_clock_idle_sleep_event() which invokes sched_clock_local() and CPU1 runs a remote wakeup for CPU0 at the same time, which invokes sched_remote_clock(). The time jump gets propagated to CPU0 via sched_remote_clock() and stays stale on both cores for ~4 seconds. There are only two other possibilities, which could cause a stale sched clock: 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic wrong value. 2) sched_clock() which reads the TSC returns a sporadic wrong value. one jiffy and then go stale. forward. It would just result in a stale sched_clock for one jiffy. After quite some brain twisting and finding the same pattern on other traces, sched_clock_remote() remained the only place which could cause such a problem and as explained above it's indeed racy on 32bit systems. So while on 64bit systems the readout is atomic, we need to verify the remote readout on 32bit machines. We need to protect the local->clock readout in sched_clock_remote() on 32bit as well because an NMI could hit between the low and the high readout, call sched_clock_local() and modify local->clock. Thanks to Siegfried Wulsch for bearing with my debug requests and going through the tedious tasks of running a bunch of reproducer systems to generate the debug information which let me decode the issue. Change-Id: I767a6a645e17651b39d0a530dc0681756f3b4f5d Reported-by: Siegfried Wulsch <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities This patch touches the RT group scheduling case. Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's priority, while rt_rq passed to them may be not the top-level rt_rq. This is wrong, because changing of priority on a child level does not guarantee that the priority is the highest all over the rq. So, this leak makes RT balancing unusable. The short example: the task having the highest priority among all rq's RT tasks (no one other task has the same priority) are waking on a throttle rt_rq. The rq's cpupri is set to the task's priority equivalent, but real rq->rt.highest_prio.curr is less. The patch below fixes the problem. Change-Id: I01d3cc07870b1825efa1deed3b70f060ac20939f Signed-off-by: Kirill Tkhai <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> CC: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Guarantee new group-entities always have weight Currently, group entity load-weights are initialized to zero. This admits some races with respect to the first time they are re-weighted in earlty use. ( Let g[x] denote the se for "g" on cpu "x". ) Suppose that we have root->a and that a enters a throttled state, immediately followed by a[0]->t1 (the only task running on cpu[0]) blocking: put_prev_task(group_cfs_rq(a[0]), t1) put_prev_entity(..., t1) check_cfs_rq_runtime(group_cfs_rq(a[0])) throttle_cfs_rq(group_cfs_rq(a[0])) Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first time: enqueue_task_fair(rq[0], t2) enqueue_entity(group_cfs_rq(b[0]), t2) enqueue_entity_load_avg(group_cfs_rq(b[0]), t2) account_entity_enqueue(group_cfs_ra(b[0]), t2) update_cfs_shares(group_cfs_rq(b[0])) < skipped because b is part of a throttled hierarchy > enqueue_entity(group_cfs_rq(a[0]), b[0]) ... We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0 which violates invariants in several code-paths. Eliminate the possibility of this by initializing group entity weight. Change-Id: I1b09c5c4f85df1de565e7c862ba8dde65851c05f Signed-off-by: Paul Turner <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <[email protected]> Cc: Chris J Arges <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Fix isolated CPUs leaving root_task_group indefinitely throttled Root task group bandwidth replenishment must service all CPUs, regardless of where the timer was last started, and regardless of the isolation mechanism, lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy. Change-Id: I8485724e3ff0960008aa13634e55bcb6ceacbc01 Signed-off-by: Mike Galbraith <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Avoid updating RT entry timeout twice within one tick period The issue below was found in 2.6.34-rt rather than mainline rt kernel, but the issue still exists upstream as well. So please let me describe how it was noticed on 2.6.34-rt: On this version, each softirq has its own thread, which means there is at least one RT FIFO task per cpu. The priority of these tasks is set to 49 by default. If user launches an RT FIFO task with priority lower than 49 of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued one cpu runqueue at one moment. By current strategy of balancing RT tasks, when it comes to RT tasks, we really need to put them off to a CPU that they can run on as soon as possible. Even if it means a bit of cache line flushing, we want RT tasks to be run with the least latency. When the user RT FIFO task which just launched before is running, the sched timer tick of the current cpu happens. In this tick period, the timeout value of the user RT task will be updated once. Subsequently, we try to wake up one softirq RT task on its local cpu. As the priority of current user RT task is lower than the softirq RT task, the current task will be preempted by the higher priority softirq RT task. Before preemption, we check to see if current can readily move to a different cpu. If so, we will reschedule to allow the RT push logic to try to move current somewhere else. Whenever the woken softirq RT task runs, it first tries to migrate the user FIFO RT task over to a cpu that is running a task of lesser priority. If migration is done, it will send a reschedule request to the found cpu by IPI interrupt. Once the target cpu responds the IPI interrupt, it will pick the migrated user RT task to preempt its current task. When the user RT task is running on the new cpu, the sched timer tick of the cpu fires. So it will tick the user RT task again. This also means the RT task timeout value will be updated again. As the migration may be done in one tick period, it means the user RT task timeout value will be updated twice within one tick. If we set a limit on the amount of cpu time for the user RT task by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon reaching the soft limit. But exactly when the SIGXCPU signal should be sent depends on the RT task timeout value. In fact the timeout mechanism of sending the SIGXCPU signal assumes the RT task timeout is increased once every tick. However, currently the timeout value may be added twice per tick. So it results in the SIGXCPU signal being sent earlier than expected. To solve this issue, we prevent the timeout value from increasing twice within one tick time by remembering the jiffies value of last updating the timeout. As long as the RT task's jiffies is different with the global jiffies value, we allow its timeout to be updated. Signed-off-by: Ying Xue <[email protected]> Signed-off-by: Fan Du <[email protected]> Reviewed-by: Yong Zhang <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> [ lizf: backported to 3.4: adjust context ] Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Change-Id: I1d126959d88f3bc1adb4a707661efb7162048424 sched: rt: Reduce rq lock contention by eliminating locking of non-feasible target This patch adds checks that prevens futile attempts to move rt tasks to a CPU with active tasks of equal or higher priority. This reduces run queue lock contention and improves the performance of a well known OLTP benchmark by 0.7%. Change-Id: I232abe50623593d51be508fc64e8b865dcc77212 Signed-off-by: Tim Chen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Shawn Bohrer <[email protected]> Cc: Suruchi Kadu <[email protected]> Cc: Doug Nelson<[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Zefan Li <[email protected]> sched: Fix RLIMIT_RTTIME when PI-boosting to RT When non-realtime tasks get priority-inheritance boosted to a realtime scheduling class, RLIMIT_RTTIME starts to apply to them. However, the counter used for checking this (the same one used for SCHED_RR timeslices) was not getting reset. This meant that tasks running with a non-realtime scheduling class which are repeatedly boosted to a realtime one, but never block while they are running realtime, eventually hit the timeout without ever running for a time over the limit. This patch resets the realtime timeslice counter when un-PI-boosting from an RT to a non-RT scheduling class. I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT mutex which induces priority boosting and spins while boosted that gets killed by a SIGXCPU on non-fixed kernels but doesn't with this patch applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and does happen eventually with PREEMPT_VOLUNTARY kernels. Change-Id: I732036119517d9a9625cea4644b4509c1d6cdf13 Signed-off-by: Brian Silverman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> [lizf: Backported to 3.4: adjust contest] Signed-off-by: Zefan Li <[email protected]> sched: Queue RT tasks to head when prio drops The following scenario does not work correctly: Runqueue of CPUx contains two runnable and pinned tasks: T1: SCHED_FIFO, prio 80 T2: SCHED_FIFO, prio 80 T1 is on the cpu and executes the following syscalls (classic priority ceiling scenario): sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90); ... sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80); ... Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back to sleep the scheduler picks T2. Surprise! The same happens w/o actual preemption when T1 is forced into the scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes pick_next_task() which returns T2. So T1 gets preempted and scheduled out. This happens because sched_setscheduler() dequeues T1 from the prio 90 list and then enqueues it on the tail of the prio 80 list behind T2. This violates the POSIX spec and surprises user space which relies on the guarantee that SCHED_FIFO tasks are not scheduled out unless they give the CPU up voluntarily or are preempted by a higher priority task. In the latter case the preempted task must get back on the CPU after the preempting task schedules out again. We fixed a similar issue already in commit 60db48c (sched: Queue a deboosted task to the head of the RT prio queue). The same treatment is necessary for sched_setscheduler(). So enqueue to head of the prio bucket list if the priority of the task is lowered. It might be possible that existing user space relies on the current behaviour, but it can be considered highly unlikely due to the corner case nature of the application scenario. Change-Id: I65f159b3c2d265dcaf27b70bf9be88cf922ab845 Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Zefan Li <[email protected]> sched: Fix the theoretical signal_wake_up() vs schedule() race This is only theoretical, but after try_to_wake_up(p) was changed to check p->state under p->pi_lock the code like: __set_current_state(TASK_INTERRUPTIBLE); schedule(); can miss a signal. This is the special case of wait-for-condition, it relies on try_to_wake_up/schedule interaction and thus it does not need mb() between __set_current_state() and if (signal_pending). However, this __set_current_state() can move into the critical section protected by rq->lock, now that try_to_wake_up() takes another lock we need to ensure that it can't be reordered with "if (signal_pending(current))" check inside that section. The patch is actually one-liner, it simply adds smp_wmb() before spin_lock_irq(rq->lock). This is what try_to_wake_up() already does by the same reason. We turn this wmb() into the new helper, smp_mb__before_spinlock(), for better documentation and to allow the architectures to change the default implementation. While at it, kill smp_mb__after_lock(), it has no callers. Perhaps we can also add smp_mb__before/after_spinunlock() for prepare_to_wait(). Change-Id: I2b329b2ae213cc9221811ca244d2ebb044011a1b Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix race in idle_balance() The scheduler main function 'schedule()' checks if there are no more tasks on the runqueue. Then it checks if a task should be pulled in the current runqueue in idle_balance() assuming it will go to idle otherwise. But idle_balance() releases the rq->lock in order to look up the sched domains and takes the lock again right after. That opens a window where another cpu may put a task in our runqueue, so we won't go to idle but we have filled the idle_stamp, thinking we will. This patch closes the window by checking if the runqueue has been modified but without pulling a task after taking the lock again, so we won't go to idle right after in the __schedule() function. Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8 Signed-off-by: Daniel Lezcano <[email protected]> Cc: [email protected] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix load avg vs cpu-hotplug Rabik and Paul reported two different issues related to the same few lines of code. Rabik's issue is that the nr_uninterruptible migration code is wrong in that he sees artifacts due to this (Rabik please do expand in more detail). Paul's issue is that this code as it stands relies on us using stop_machine() for unplug, we all would like to remove this assumption so that eventually we can remove this stop_machine() usage altogether. The only reason we'd have to migrate nr_uninterruptible is so that we could use for_each_online_cpu() loops in favour of for_each_possible_cpu() loops, however since nr_uninterruptible() is the only such loop and its using possible lets not bother at all. The problem Rabik sees is (probably) caused by the fact that by migrating nr_uninterruptible we screw rq->calc_load_active for both rqs involved. So don't bother with fancy migration schemes (meaning we now have to keep using for_each_possible_cpu()) and instead fold any nr_active delta after we migrate all tasks away to make sure we don't have any skewed nr_active accounting. Reported-by: Rakib Mullick <[email protected]> Reported-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins Signed-off-by: Ingo Molnar <[email protected]> sched: Fix load avg vs. cpu-hotplug Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an incomplete fix: In particular, the problem is that at the point it calls calc_load_migrate() nr_running := 1 (the stopper thread), so move the call to CPU_DEAD where we're sure that nr_running := 0. Also note that we can call calc_load_migrate() without serialization, we know the state of rq is stable since its cpu is dead, and we modify the global state using appropriate atomic ops. Change-Id: I24063eebf4429d7e01838af2cb68e0f3c2cf2131 Suggested-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins Signed-off-by: Ingo Molnar <[email protected]> sched/nohz: Rewrite and fix load-avg computation -- again Thanks to Charles Wang for spotting the defects in the current code: - If we go idle during the sample window -- after sampling, we get a negative bias because we can negate our own sample. - If we wake up during the sample window we get a positive bias because we push the sample to a known active period. So rewrite the entire nohz load-avg muck once again, now adding copious documentation to the code. Reported-and-tested-by: Doug Smythies <[email protected]> Reported-and-tested-by: Charles Wang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins [ minor edits ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Change-Id: Ib735002c79eaace9fa5f7795fdc2f80b638b56a7 sched: Fix the broken sched_rr_get_interval() The caller of sched_sliced() should pass se.cfs_rq and se as the arguments, however in sched_rr_get_interval() we gave it rq.cfs_rq and se, which made the following computation obviously wrong. The change was introduced by commit: 77034937dc45 sched: fix crash in sys_sched_rr_get_interval() ... 5 years ago, while it had been the correct 'cfs_rq_of' before the commit. The change seems to be irrelevant to the commit msg, which was to return a 0 timeslice for tasks that are on an idle runqueue. So I believe that was just a plain typo. Signed-off-by: Zhu Yanhai <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] [ Since this is an ABI and an old bug, we'll test this via a slow upstream route, to hopefully discover any app breakage. ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix race in task_group() Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call task_group() too many times in set_task_rq()"). He found the reason to be that the multiple task_group() invocations in set_task_rq() returned different values. Looking at all that I found a lack of serialization and plain wrong comments. The below tries to fix it using an extra pointer which is updated under the appropriate scheduler locks. Its not pretty, but I can't really see another way given how all the cgroup stuff works. Change-Id: I6906c92ea6f66d19af5938a9f5ec9f1ff4577c39 Reported-and-tested-by: Stefan Bader <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed masks as and when necessary to ensure that the tasks belonging to the cpusets have some place (online CPUs) to run on. And regular CPU hotplug is destructive in the sense that the kernel doesn't remember the original cpuset configurations set by the user, across hotplug operations. However, suspend/resume (which uses CPU hotplug) is a special case in which the kernel has the responsibility to restore the system (during resume), to exactly the same state it was in before suspend. In order to achieve that, do the following: 1. Don't modify cpusets during suspend/resume. At all. In particular, don't move the tasks from one cpuset to another, and don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets during the CPU hotplug operations that are carried out in the suspend/resume path. 2. However, cpusets and sched domains are related. We just want to avoid altering cpusets alone. So, to keep the sched domains updated, build a single sched domain (containing all active cpus) during each of the CPU hotplug operations carried out in s/r path, effectively ignoring the cpusets' cpus_allowed masks. (Since userspace is frozen while doing all this, it will go unnoticed.) 3. During the last CPU online operation during resume, build the sched domains by looking up the (unaltered) cpusets' cpus_allowed masks. That will bring back the system to the same original state as it was in before suspend. Ultimately, this will not only solve the cpuset problem related to suspend resume (ie., restores the cpusets to exactly what it was before suspend, by not touching it at all) but also speeds up suspend/resume because we avoid running cpuset update code for every CPU being offlined/onlined. Change-Id: I12c3e6289d5aa069531f8f616d63412b38103d99 Signed-off-by: Srivatsa S. Bhat <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Preeti U Murthy <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones There is a small race between copy_process() and cgroup_attach_task() where child->se.parent,cfs_rq points to invalid (old) ones. parent doing fork() | someone moving the parent to another cgroup -------------------------------+--------------------------------------------- copy_process() + dup_task_struct() -> parent->se is copied to child->se. se.parent,cfs_rq of them point to old ones. cgroup_attach_task() + cgroup_task_migrate() -> parent->cgroup is updated. + cpu_cgroup_attach() + sched_move_task() + task_move_group_fair() +- set_task_rq() -> se.parent,cfs_rq of parent are updated. + cgroup_fork() -> parent->cgroup is copied to child->cgroup. (*1) + sched_fork() + task_fork_fair() -> se.parent,cfs_rq of child are accessed while they point to old ones. (*2) In the worst case, this bug can lead to "use-after-free" and cause a panic, because it's new cgroup's refcount that is incremented at (*1), so the old cgroup(and related data) can be freed before (*2). In fact, a panic caused by this bug was originally caught in RHEL6.4. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0 [...] Call Trace: [<ffffffff81051f25>] place_entity+0x75/0xa0 [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160 [<ffffffff81063c0b>] sched_fork+0x6b/0x140 [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450 [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130 [<ffffffff8106d2f4>] do_fork+0x94/0x460 [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100 [<ffffffff81009598>] sys_clone+0x28/0x30 [<ffffffff8100b393>] stub_clone+0x13/0x20 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Daisuke Nishimura <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/nohz: Fix rq->cpu_load[] calculations While investigating why the load-balancer did funny I found that the rq->cpu_load[] tables were completely screwy.. a bit more digging revealed that the updates that got through were missing ticks followed by a catchup of 2 ticks. The catchup assumes the cpu was idle during that time (since only nohz can cause missed ticks and the machine is idle etc..) this means that esp. the higher indices were significantly lower than they ought to be. The reason for this is that its not correct to compare against jiffies on every jiffy on any other cpu than the cpu that updates jiffies. This patch cludges around it by only doing the catch-up stuff from nohz_idle_balance() and doing the regular stuff unconditionally from the tick. Change-Id: Iac6eda1ded5074afb261ccc0fe8a4e70432bb031 Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Venkatesh Pallipadi <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/nohz: Fix rq->cpu_load calculations some more commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream. Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[] calculations") since while that fixed the busy case it regressed the mostly idle case. Add a callback from the nohz exit to also age the rq->cpu_load[] array. This closes the hole where either there was no nohz load balance pass during the nohz, or there was a 'significant' amount of idle time between the last nohz balance and the nohz exit. So we'll update unconditionally from the tick to not insert any accidental 0 load periods while busy, and we try and catch up from nohz idle balance and nohz exit. Both these are still prone to missing a jiffy, but that has always been the case. Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Venkatesh Pallipadi <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/core: Clear the root_domain cpumasks in init_rootdomain() root_domain::rto_mask allocated through alloc_cpumask_var() contains garbage data, and this may cause problems. For instance, when doing pull_rt_task(), it may do useless iterations if rto_mask retains some extra garbage bits. Worse still, this violates the isolated domain rule for clustered scheduling using cpuset, because the tasks (with all the cpus allowed) belongs to one root domain can be pulled away into another root domain. The patch cleans the garbage by using zalloc_cpumask_var() instead of alloc_cpumask_var() for root_domain::rto_mask allocation, thereby addressing the issues. Do the same thing for root_domain's other cpumask memembers: - span, - online; Signed-off-by: Xunlei Pang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> softirq: Reduce latencies In various network workloads, __do_softirq() latencies can be up to 20 ms if HZ=1000, and 200 ms if HZ=100. This is because we iterate 10 times in the softirq dispatcher, and some actions can consume a lot of cycles. This patch changes the fallback to ksoftirqd condition to: - A time limit of 2 ms, - need_resched() being set on current task; When one of this condition is met, we wakeup ksoftirqd for further softirq processing if we still have pending softirqs. Using need_resched() as the only condition can trigger RCU stalls, as we can keep BH disabled for too long. I ran several benchmarks and got no significant difference in throughput, but a very significant reduction of latencies (one order of magnitude): In following bench, 200 antagonist "netperf -t TCP_RR" are started in background, using all available cpus. Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC IRQ (hard + soft). Before patch: RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=550110.424 MIN_LATENCY=146858 MAX_LATENCY=997109 P50_LATENCY=305000 P90_LATENCY=550000 P99_LATENCY=710000 MEAN_LATENCY=376989.12 STDDEV_LATENCY=184046.92 After patch: RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=40545.492 MIN_LATENCY=9834 MAX_LATENCY=78366 P50_LATENCY=33583 P90_LATENCY=59000 P99_LATENCY=69000 MEAN_LATENCY=38364.67 STDDEV_LATENCY=12865.26 Signed-off-by: Eric Dumazet <[email protected]> Cc: David Miller <[email protected]> Cc: Tom Herbert <[email protected]> Cc: Ben Hutchings <[email protected]> Signed-off-by: David S. Miller <[email protected]> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <[email protected]> Signed-off-by: Zefan Li <[email protected]> softirq: Fix lockup related to stop_machine being stuck in __do_softirq. The stop machine logic can lock up if all but one of the migration threads make it through the disable-irq step and the one remaining thread gets stuck in __do_softirq. The reason __do_softirq can hang is that it has a bail-out based on jiffies timeout, but in the lockup case, jiffies itself is not incremented. To work around this, re-add the max_restart counter in __do_irq and stop processing irqs after 10 restarts. Thanks to Tejun Heo and Rusty Russell and others for helping me track this down. This was introduced in 3.9 by commit c10d73671ad3 ("softirq: reduce latencies"). It may be worth looking into ath9k to see if it has issues with its irq handler at a later date. The hang stack traces look something like this: ------------[ cut here ]------------ WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7() Watchdog detected hard LOCKUP on cpu 2 Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] Pid: 23, comm: migration/2 Tainted: G C 3.9.4+ #11 Call Trace: <NMI> warn_slowpath_common+0x85/0x9f warn_slowpath_fmt+0x46/0x48 watchdog_overflow_callback+0x9c/0xa7 __perf_event_overflow+0x137/0x1cb perf_event_overflow+0x14/0x16 intel_pmu_handle_irq+0x2dc/0x359 perf_event_nmi_handler+0x19/0x1b nmi_handle+0x7f/0xc2 do_nmi+0xbc/0x304 end_repeat_nmi+0x1e/0x2e <<EOE>> cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 ---[ end trace 4947dfa9b0a4cec3 ]--- BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17] Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] irq event stamp: 835637905 hardirqs last enabled at (835637904): __do_softirq+0x9f/0x257 hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80 softirqs last enabled at (5654720): __do_softirq+0x1ff/0x257 softirqs last disabled at (5654725): irq_exit+0x5f/0xbb CPU 1 Pid: 17, comm: migration/1 Tainted: G WC 3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: tasklet_hi_action+0xf0/0xf0 Process migration/1 Call Trace: <IRQ> __do_softirq+0x117/0x257 irq_exit+0x5f/0xbb smp_apic_timer_interrupt+0x8a/0x98 apic_timer_interrupt+0x72/0x80 <EOI> printk+0x4d/0x4f stop_machine_cpu_stop+0x22c/0x274 cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 Change-Id: Ib213d5a5c59c85ae16afde6f42a01e39d2c93fd7 Signed-off-by: Ben Greear <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Pekka Riikonen <[email protected]> Cc: Eric Dumazet <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <[email protected]> Signed-off-by: Zefan Li <[email protected]>
alesaiko
pushed a commit
that referenced
this issue
Mar 14, 2018
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior of waking their tasks up on idle CPUs. The feature has too much of a negative impact on other workloads however to apply globally. The PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this flag set, or tasks woken by tasks with this flag set, on an idle CPU if one is available. Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039 Signed-off-by: Steve Muckle <[email protected]> sched: add sysctl for controlling task migrations on wake The PF_WAKE_UP_IDLE per-task flag made it impossible to enable the old behavior of SD_SHARE_PKG_RESOURCES, where every task migrates to an idle CPU on wakeup. The sched_wake_to_idle sysctl value, when made nonzero, will cause all tasks to migrate to an idle CPU if one is available when the task is woken up. This is regardless of how PF_WAKE_UP_IDLE is configured for tasks in the system. Similar to PF_WAKE_UP_IDLE, the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled for the sysctl value to have an effect. Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61 Signed-off-by: Steve Muckle <[email protected]> (cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282) Conflicts: kernel/sched/fair.c sched: Reset rq->next_interval before going idle next_balance, the point in jiffy time scale when a cpu will next load balance, could have been calculated when the cpu was busy. A busy cpu will apply its sched domain's busy_factor (usually > 1) in computing next_balance for that sched domain, which causes the (busy) cpu to load balance less frequently in its sched domains. However when the same cpu is going idle, its next_balance needs to be reset without consideration of busy_factor. Failure to do so would not trigger nohz idle balancer on that cpu for unnecessarily long time (introducing additional scheduling latencies for tasks). Fix bug in scheduler which aims to reset next_balance before a cpu goes idle (as per existing comment) but is clearly not doing so. Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4 Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Fix SCHED_HRTICK bug leading to late preemption of tasks SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot (just when they would have exceeded their ideal_runtime). It makes use of a a per-cpu hrtimer resource and hence alarming that hrtimer should be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs, rather than being based on number of tasks in a particular cfs_rq (as implemented currently). As a result, with current code, its possible for a running task (which is the sole task in its cfs_rq) to be preempted much after its ideal_runtime has elapsed, resulting in increased latency for tasks in other cfs_rq on same cpu. Fix this by alarming sched hrtimer based on total number of SCHED_FAIR tasks a CPU has across its various cfs_rqs. Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Mark schedule_io_timeout() with EXPORT_SYMBOL Make schedule_io_timeout() visible to modules. Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2 Signed-off-by: Jordan Crouse <[email protected]> sched: fix reference to wrong cfs_rq Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption of tasks) introduced a bug in sched_slice() calculation by using wrong cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather than the correct one to which they belonged. Fix the bug by using correct cfs_rq for tasks. Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Make sure to not re-read variables after validation We could re-read rq->rt_avg after we validated it was smaller than total, invalidating the check and resulting in an unintended negative. Signed-off-by: Peter Zijlstra <[email protected]> Cc: David Rientjes <[email protected]> Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins Signed-off-by: Ingo Molnar <[email protected]> CRs-Fixed: 497236 Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Srivatsa Vaddagiri <[email protected]> (cherry picked from commit fef154fa32c5d8ef992f11e11919495979970678) Change-Id: Ieac7a10d9f4ceb701660f8055efd7a0ed5d939ca Signed-off-by: Sridhar Nelloru <[email protected]> sched: re-calculate a cpu's next_balance point upon sched domain changes Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset a cpu's rq->next_balance when pulled_task = 0, which will be true when the cpu failed to pull any task, causing it go idle. However that patch relied on next_balance being calculated as a result of traversing cpu's sched domain hierarchy. A cpu that is the only online cpu will however not be attached to any sched domain hierarchy. When such a cpu calls into idle_balance(), we will end up initializing next_balance to be 1sec away! Such a CPU will defer load balance check for another 1sec, even though we may bring up more cpus in the meantime requiring it to check for load imbalance more frequently. This could then lead to increased scheduling latency for some tasks. This patch results in a cpu's next_balance being re-calculated when its attaching to a new sched domain hierarchy. This should let cpus call load balance checks at the right time we expect them to! Signed-off-by: Srivatsa Vaddagiri <[email protected]> (cherry picked from commit 91b609bdd91513c37c410ffef0204f8a0578e4f8) Change-Id: Ib457f1a357aad26e49837a01311b73e22e0cc26f Signed-off-by: Sridhar Nelloru <[email protected]> sched: Unthrottle rt runqueues in __disable_runtime() migrate_tasks() uses _pick_next_task_rt() to get tasks from the real-time runqueues to be migrated. When rt_rq is throttled _pick_next_task_rt() won't return anything, in which case migrate_tasks() can't move all threads over and gets stuck in an infinite loop. Instead unthrottle rt runqueues before migrating tasks. Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair() Signed-off-by: Peter Boonstoppel <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix clear NOHZ_BALANCE_KICK I have faced a sequence where the Idle Load Balance was sometime not triggered for a while on my platform, in the following scenario: CPU 0 and CPU 1 are running tasks and CPU 2 is idle CPU 1 kicks the Idle Load Balance CPU 1 selects CPU 2 as the new Idle Load Balancer CPU 2 sets NOHZ_BALANCE_KICK for CPU 2 CPU 2 sends a reschedule IPI to CPU 2 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance task A quickly goes back to sleep (before a tick occurs on CPU 2) CPU 2 goes back to idle with NOHZ_BALANCE_KICK set Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be performed. We must wait for the sched softirq to be raised on CPU 2 thanks to another part the kernel to come back to clear NOHZ_BALANCE_KICK. The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if we can't raise the sched_softirq for the Idle Load Balance. Change since V1: - move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB can't run on this CPU (as suggested by Peter) Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 873b4c65b519fd769940eb281f77848227d4e5c1 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [[email protected]: minor merge resolution for 3.4 in scheduler_ipi()] Signed-off-by: Steve Muckle <[email protected]> Change-Id: I3548612057cccc2ecc29429c129c44183083831f sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s try_to_wake_up_local() should only be invoked to wake up another task in the same runqueue and BUG_ON()s are used to enforce the rule. Missing try_to_wake_up_local() can stall workqueue execution but such stalls are likely to be finite either by another work item being queued or the one blocked getting unblocked. There's no reason to trigger BUG while holding rq lock crashing the whole system. Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s. Change-Id: I75fdaaf4dcaefcf3893e0404d98c8aa91f89934d Signed-off-by: Tejun Heo <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Git-commit: 383efcd00053ec40023010ce5034bd702e7ab373 CRs-Fixed: 519942 Signed-off-by: Steve Muckle <[email protected]> sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local() The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local() were recently converted from BUG_ON() calls. If these hit it indicates something is wrong and that may contribute to other system instability. To eliminate the risk of an instance of one of these errors going un-noticed because there was an earlier instance that occured long ago, change to WARN_ON(). If there ever is a flood of these there are bigger problems. Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7 Signed-off-by: Steve Muckle <[email protected]> sched: Convert WARN_ON() to printk_sched() in try_to_wake_up_local() try_to_wake_up_local() is called with the rq lock held. Printing to console in this context can result in a deadlock if klogd needs to be woken up. Print to the kernel log buffer via printk_sched() instead which avoids the wakeup. Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c Signed-off-by: Syed Rameez Mustafa <[email protected]> sched/rt: Use root_domain of rt_rq not current processor When the system has multiple domains do_sched_rt_period_timer() can run on any CPU and may iterate over all rt_rq in cpu_online_mask. This means when balance_runtime() is run for a given rt_rq that rt_rq may be in a different rd than the current processor. Thus if we use smp_processor_id() to get rd in do_balance_runtime() we may borrow runtime from a rt_rq that is not part of our rd. This changes do_balance_runtime to get the rd from the passed in rt_rq ensuring that we borrow runtime only from the correct rd for the given rt_rq. This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime when we try reclaim runtime lent to other rt_rq but runtime has been lent to a rt_rq in another rd. Change-Id: Id8b9f88bdd783ef87bfa8d288a2fe7e86dd0b1e2 Signed-off-by: Shawn Bohrer <[email protected]> Acked-by: Steven Rostedt <[email protected]> Acked-by: Mike Galbraith <[email protected]> Cc: [email protected] Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 89960feebaf4f9a53f93a0ce6888207e4a808799 Git-repo: git://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git Signed-off-by: Syed Rameez Mustafa <[email protected]> sched: Remove redundant update_runtime notifier migration_call() will do all the things that update_runtime() does. So let's remove it. Furthermore, there is potential risk that the current code will catch BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime threads running because of enabling runtime twice while the rt_runtime may already changed. Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8 Signed-off-by: Neil Zhang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [[email protected]: resolved trivial header file context conflict] Signed-off-by: Matt Wagantall <[email protected]> sched: Fix hotplug vs. set_cpus_allowed_ptr() Lai found that: WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b() ... migration_cpu_stop+0x1d/0x22 was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is always a sub-set of cpu_online_mask. This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness"). So set active and online at the same time to avoid this particular problem. CRs-Fixed: 680496 Change-Id: I89ac9b6829acf200072975bc7d028a469167f083 Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness") Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Gautham R. Shenoy <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael wang <[email protected]> Cc: Paul Gortmaker <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Toshi Kani <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 24d52daafcd53f109893ec2faf668c8eeef4a382 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Matt Wagantall <[email protected]> sched: Fix race between try_to_wake_up() and move_task() Until a task's state has been seen as interruptible/uninterruptible and it is no longer on_cpu, it is possible that the task may move to another CPU (load balancing may cause this). Here is an example where the race condition results in incorrect operation: - cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING - cpu 0 runs task B, which attempts to wake up A - cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0 - cpu 1 then pulls task A (perhaps due to idle balance) - cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE - cpu 0 continues in try_to_wake_up(), thinking task A's previous cpu is 0, where it is actually 1 - if select_task_rq returns cpu 0, task A will be woken up on cpu 0 without properly updating its cpu to 0 in set_task_cpu() CRs-Fixed: 665958 Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99 Signed-off-by: Steve Muckle <[email protected]> sched: Fix reference to stale task_struct in try_to_wake_up() try_to_wake_up() currently drops p->pi_lock and later checks for need to notify cpufreq governor on task migrations or wakeups. However the woken task could exit between the time p->pi_lock is released and the time the test for notification is run. As a result, the test for notification could refer to an exited task. task_notify_on_migrate(p) could thus lead to invalid memory reference. Fix this by running the test for notification with task's pi_lock held. Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1 Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Bail out of yield_to when source and target runqueue has one task In case of undercomitted scenarios, especially in large guests yield_to overhead is significantly high. when run queue length of source and target is one, take an opportunity to bail out and return -ESRCH. This return condition can be further exploited to quickly come out of PLE handler. (History: Raghavendra initially worked on break out of kvm ple handler upon seeing source runqueue length = 1, but it had to export rq length). Peter came up with the elegant idea of return -ESRCH in scheduler core. Signed-off-by: Peter Zijlstra <[email protected]> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi) Reviewed-by: Srikar Dronamraju <[email protected]> Signed-off-by: Raghavendra K T <[email protected]> Acked-by: Andrew Jones <[email protected]> Tested-by: Chegu Vinod <[email protected]> Signed-off-by: Gleb Natapov <[email protected]> Change-Id: I4eca8afc683c1a4b9d2490422baacd3da76c1e92 sched: Fix select_idle_sibling() bouncing cow syndrome If the previous CPU is cache affine and idle, select it. The current implementation simply traverses the sd_llc domain, taking the first idle CPU encountered, which walks buddy pairs hand in hand over the package, inflicting excruciating pain. 1 tbench pair (worst case) in a 10 core + SMT package: pre 15.22 MB/sec 1 procs post 252.01 MB/sec 1 procs Signed-off-by: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix double normalization of vruntime dequeue_entity() is called when p->on_rq and sets se->on_rq = 0 which appears to guarentee that the !se->on_rq condition is met. If the task has done set_current_state(TASK_INTERRUPTIBLE) without schedule() the second condition will be met and vruntime will be incorrectly adjusted twice. In certain cases this can result in the task's vruntime never increasing past the vruntime of other tasks on the CFS' run queue, starving them of CPU time. This patch changes switched_from_fair() to use !p->on_rq instead of !se->on_rq. I'm able to cause a task with a priority of 120 to starve all other tasks with the same priority on an ARM platform running 3.2.51-rt72 PREEMPT RT by writing one character at time to a serial tty (16550 UART) in a tight loop. I'm also able to verify making this change corrects the problem on that platform and kernel version. Change-Id: I70aa0fe0e9be146b1a2a903d010b318390262dc3 Signed-off-by: George McCollister <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix divide by zero at {thread_group,task}_times On architectures where cputime_t is 64 bit type, is possible to trigger divide by zero on do_div(temp, (__force u32) total) line, if total is a non zero number but has lower 32 bit's zeroed. Removing casting is not a good solution since some do_div() implementations do cast to u32 internally. This problem can be triggered in practice on very long lived processes: PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin" #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2 #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00 #3 [ffff880472a51cd0] die at ffffffff8100f26b #4 [ffff880472a51d00] do_trap at ffffffff814f03f4 #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff #6 [ffff880472a51e00] divide_error at ffffffff8100be7b [exception RIP: thread_group_times+0x56] RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046 RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800 R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08 R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d #8 [ffff880472a51f40] sys_times at ffffffff81088524 #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2 RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202 RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000 RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0 RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940 R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940 ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b Change-Id: I1e0fa1fa38b86566f5b20e5c0efa0c9eb17b203b Signed-off-by: Stanislaw Gruszka <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched_clock: Prevent 64bit inatomicity on 32bit systems The sched_clock_remote() implementation has the following inatomicity problem on 32bit systems when accessing the remote scd->clock, which is a 64bit value. CPU0 CPU1 sched_clock_local() sched_clock_remote(CPU0) ... remote_clock = scd[CPU0]->clock read_low32bit(scd[CPU0]->clock) cmpxchg64(scd->clock,...) read_high32bit(scd[CPU0]->clock) While the update of scd->clock is using an atomic64 mechanism, the readout on the remote cpu is not, which can cause completely bogus readouts. It is a quite rare problem, because it requires the update to hit the narrow race window between the low/high readout and the update must go across the 32bit boundary. The resulting misbehaviour is, that CPU1 will see the sched_clock on CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value to this bogus timestamp. This stays that way due to the clamping implementation for about 4 seconds until the synchronization with CLOCK_MONOTONIC undoes the problem. The issue is hard to observe, because it might only result in a less accurate SCHED_OTHER timeslicing behaviour. To create observable damage on realtime scheduling classes, it is necessary that the bogus update of CPU1 sched_clock happens in the context of an realtime thread, which then gets charged 4 seconds of RT runtime, which results in the RT throttler mechanism to trigger and prevent scheduling of RT tasks for a little less than 4 seconds. So this is quite unlikely as well. The issue was quite hard to decode as the reproduction time is between 2 days and 3 weeks and intrusive tracing makes it less likely, but the following trace recorded with trace_clock=global, which uses sched_clock_local(), gave the final hint: <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 What happens is that CPU0 goes idle and invokes sched_clock_idle_sleep_event() which invokes sched_clock_local() and CPU1 runs a remote wakeup for CPU0 at the same time, which invokes sched_remote_clock(). The time jump gets propagated to CPU0 via sched_remote_clock() and stays stale on both cores for ~4 seconds. There are only two other possibilities, which could cause a stale sched clock: 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic wrong value. 2) sched_clock() which reads the TSC returns a sporadic wrong value. one jiffy and then go stale. forward. It would just result in a stale sched_clock for one jiffy. After quite some brain twisting and finding the same pattern on other traces, sched_clock_remote() remained the only place which could cause such a problem and as explained above it's indeed racy on 32bit systems. So while on 64bit systems the readout is atomic, we need to verify the remote readout on 32bit machines. We need to protect the local->clock readout in sched_clock_remote() on 32bit as well because an NMI could hit between the low and the high readout, call sched_clock_local() and modify local->clock. Thanks to Siegfried Wulsch for bearing with my debug requests and going through the tedious tasks of running a bunch of reproducer systems to generate the debug information which let me decode the issue. Change-Id: I767a6a645e17651b39d0a530dc0681756f3b4f5d Reported-by: Siegfried Wulsch <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities This patch touches the RT group scheduling case. Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's priority, while rt_rq passed to them may be not the top-level rt_rq. This is wrong, because changing of priority on a child level does not guarantee that the priority is the highest all over the rq. So, this leak makes RT balancing unusable. The short example: the task having the highest priority among all rq's RT tasks (no one other task has the same priority) are waking on a throttle rt_rq. The rq's cpupri is set to the task's priority equivalent, but real rq->rt.highest_prio.curr is less. The patch below fixes the problem. Change-Id: I01d3cc07870b1825efa1deed3b70f060ac20939f Signed-off-by: Kirill Tkhai <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> CC: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Guarantee new group-entities always have weight Currently, group entity load-weights are initialized to zero. This admits some races with respect to the first time they are re-weighted in earlty use. ( Let g[x] denote the se for "g" on cpu "x". ) Suppose that we have root->a and that a enters a throttled state, immediately followed by a[0]->t1 (the only task running on cpu[0]) blocking: put_prev_task(group_cfs_rq(a[0]), t1) put_prev_entity(..., t1) check_cfs_rq_runtime(group_cfs_rq(a[0])) throttle_cfs_rq(group_cfs_rq(a[0])) Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first time: enqueue_task_fair(rq[0], t2) enqueue_entity(group_cfs_rq(b[0]), t2) enqueue_entity_load_avg(group_cfs_rq(b[0]), t2) account_entity_enqueue(group_cfs_ra(b[0]), t2) update_cfs_shares(group_cfs_rq(b[0])) < skipped because b is part of a throttled hierarchy > enqueue_entity(group_cfs_rq(a[0]), b[0]) ... We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0 which violates invariants in several code-paths. Eliminate the possibility of this by initializing group entity weight. Change-Id: I1b09c5c4f85df1de565e7c862ba8dde65851c05f Signed-off-by: Paul Turner <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <[email protected]> Cc: Chris J Arges <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Fix isolated CPUs leaving root_task_group indefinitely throttled Root task group bandwidth replenishment must service all CPUs, regardless of where the timer was last started, and regardless of the isolation mechanism, lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy. Change-Id: I8485724e3ff0960008aa13634e55bcb6ceacbc01 Signed-off-by: Mike Galbraith <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Avoid updating RT entry timeout twice within one tick period The issue below was found in 2.6.34-rt rather than mainline rt kernel, but the issue still exists upstream as well. So please let me describe how it was noticed on 2.6.34-rt: On this version, each softirq has its own thread, which means there is at least one RT FIFO task per cpu. The priority of these tasks is set to 49 by default. If user launches an RT FIFO task with priority lower than 49 of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued one cpu runqueue at one moment. By current strategy of balancing RT tasks, when it comes to RT tasks, we really need to put them off to a CPU that they can run on as soon as possible. Even if it means a bit of cache line flushing, we want RT tasks to be run with the least latency. When the user RT FIFO task which just launched before is running, the sched timer tick of the current cpu happens. In this tick period, the timeout value of the user RT task will be updated once. Subsequently, we try to wake up one softirq RT task on its local cpu. As the priority of current user RT task is lower than the softirq RT task, the current task will be preempted by the higher priority softirq RT task. Before preemption, we check to see if current can readily move to a different cpu. If so, we will reschedule to allow the RT push logic to try to move current somewhere else. Whenever the woken softirq RT task runs, it first tries to migrate the user FIFO RT task over to a cpu that is running a task of lesser priority. If migration is done, it will send a reschedule request to the found cpu by IPI interrupt. Once the target cpu responds the IPI interrupt, it will pick the migrated user RT task to preempt its current task. When the user RT task is running on the new cpu, the sched timer tick of the cpu fires. So it will tick the user RT task again. This also means the RT task timeout value will be updated again. As the migration may be done in one tick period, it means the user RT task timeout value will be updated twice within one tick. If we set a limit on the amount of cpu time for the user RT task by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon reaching the soft limit. But exactly when the SIGXCPU signal should be sent depends on the RT task timeout value. In fact the timeout mechanism of sending the SIGXCPU signal assumes the RT task timeout is increased once every tick. However, currently the timeout value may be added twice per tick. So it results in the SIGXCPU signal being sent earlier than expected. To solve this issue, we prevent the timeout value from increasing twice within one tick time by remembering the jiffies value of last updating the timeout. As long as the RT task's jiffies is different with the global jiffies value, we allow its timeout to be updated. Signed-off-by: Ying Xue <[email protected]> Signed-off-by: Fan Du <[email protected]> Reviewed-by: Yong Zhang <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> [ lizf: backported to 3.4: adjust context ] Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Change-Id: I1d126959d88f3bc1adb4a707661efb7162048424 sched: rt: Reduce rq lock contention by eliminating locking of non-feasible target This patch adds checks that prevens futile attempts to move rt tasks to a CPU with active tasks of equal or higher priority. This reduces run queue lock contention and improves the performance of a well known OLTP benchmark by 0.7%. Change-Id: I232abe50623593d51be508fc64e8b865dcc77212 Signed-off-by: Tim Chen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Shawn Bohrer <[email protected]> Cc: Suruchi Kadu <[email protected]> Cc: Doug Nelson<[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Zefan Li <[email protected]> sched: Fix RLIMIT_RTTIME when PI-boosting to RT When non-realtime tasks get priority-inheritance boosted to a realtime scheduling class, RLIMIT_RTTIME starts to apply to them. However, the counter used for checking this (the same one used for SCHED_RR timeslices) was not getting reset. This meant that tasks running with a non-realtime scheduling class which are repeatedly boosted to a realtime one, but never block while they are running realtime, eventually hit the timeout without ever running for a time over the limit. This patch resets the realtime timeslice counter when un-PI-boosting from an RT to a non-RT scheduling class. I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT mutex which induces priority boosting and spins while boosted that gets killed by a SIGXCPU on non-fixed kernels but doesn't with this patch applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and does happen eventually with PREEMPT_VOLUNTARY kernels. Change-Id: I732036119517d9a9625cea4644b4509c1d6cdf13 Signed-off-by: Brian Silverman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> [lizf: Backported to 3.4: adjust contest] Signed-off-by: Zefan Li <[email protected]> sched: Queue RT tasks to head when prio drops The following scenario does not work correctly: Runqueue of CPUx contains two runnable and pinned tasks: T1: SCHED_FIFO, prio 80 T2: SCHED_FIFO, prio 80 T1 is on the cpu and executes the following syscalls (classic priority ceiling scenario): sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90); ... sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80); ... Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back to sleep the scheduler picks T2. Surprise! The same happens w/o actual preemption when T1 is forced into the scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes pick_next_task() which returns T2. So T1 gets preempted and scheduled out. This happens because sched_setscheduler() dequeues T1 from the prio 90 list and then enqueues it on the tail of the prio 80 list behind T2. This violates the POSIX spec and surprises user space which relies on the guarantee that SCHED_FIFO tasks are not scheduled out unless they give the CPU up voluntarily or are preempted by a higher priority task. In the latter case the preempted task must get back on the CPU after the preempting task schedules out again. We fixed a similar issue already in commit 60db48c (sched: Queue a deboosted task to the head of the RT prio queue). The same treatment is necessary for sched_setscheduler(). So enqueue to head of the prio bucket list if the priority of the task is lowered. It might be possible that existing user space relies on the current behaviour, but it can be considered highly unlikely due to the corner case nature of the application scenario. Change-Id: I65f159b3c2d265dcaf27b70bf9be88cf922ab845 Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Zefan Li <[email protected]> sched: Fix the theoretical signal_wake_up() vs schedule() race This is only theoretical, but after try_to_wake_up(p) was changed to check p->state under p->pi_lock the code like: __set_current_state(TASK_INTERRUPTIBLE); schedule(); can miss a signal. This is the special case of wait-for-condition, it relies on try_to_wake_up/schedule interaction and thus it does not need mb() between __set_current_state() and if (signal_pending). However, this __set_current_state() can move into the critical section protected by rq->lock, now that try_to_wake_up() takes another lock we need to ensure that it can't be reordered with "if (signal_pending(current))" check inside that section. The patch is actually one-liner, it simply adds smp_wmb() before spin_lock_irq(rq->lock). This is what try_to_wake_up() already does by the same reason. We turn this wmb() into the new helper, smp_mb__before_spinlock(), for better documentation and to allow the architectures to change the default implementation. While at it, kill smp_mb__after_lock(), it has no callers. Perhaps we can also add smp_mb__before/after_spinunlock() for prepare_to_wait(). Change-Id: I2b329b2ae213cc9221811ca244d2ebb044011a1b Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix race in idle_balance() The scheduler main function 'schedule()' checks if there are no more tasks on the runqueue. Then it checks if a task should be pulled in the current runqueue in idle_balance() assuming it will go to idle otherwise. But idle_balance() releases the rq->lock in order to look up the sched domains and takes the lock again right after. That opens a window where another cpu may put a task in our runqueue, so we won't go to idle but we have filled the idle_stamp, thinking we will. This patch closes the window by checking if the runqueue has been modified but without pulling a task after taking the lock again, so we won't go to idle right after in the __schedule() function. Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8 Signed-off-by: Daniel Lezcano <[email protected]> Cc: [email protected] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix load avg vs cpu-hotplug Rabik and Paul reported two different issues related to the same few lines of code. Rabik's issue is that the nr_uninterruptible migration code is wrong in that he sees artifacts due to this (Rabik please do expand in more detail). Paul's issue is that this code as it stands relies on us using stop_machine() for unplug, we all would like to remove this assumption so that eventually we can remove this stop_machine() usage altogether. The only reason we'd have to migrate nr_uninterruptible is so that we could use for_each_online_cpu() loops in favour of for_each_possible_cpu() loops, however since nr_uninterruptible() is the only such loop and its using possible lets not bother at all. The problem Rabik sees is (probably) caused by the fact that by migrating nr_uninterruptible we screw rq->calc_load_active for both rqs involved. So don't bother with fancy migration schemes (meaning we now have to keep using for_each_possible_cpu()) and instead fold any nr_active delta after we migrate all tasks away to make sure we don't have any skewed nr_active accounting. Reported-by: Rakib Mullick <[email protected]> Reported-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins Signed-off-by: Ingo Molnar <[email protected]> sched: Fix load avg vs. cpu-hotplug Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an incomplete fix: In particular, the problem is that at the point it calls calc_load_migrate() nr_running := 1 (the stopper thread), so move the call to CPU_DEAD where we're sure that nr_running := 0. Also note that we can call calc_load_migrate() without serialization, we know the state of rq is stable since its cpu is dead, and we modify the global state using appropriate atomic ops. Change-Id: I24063eebf4429d7e01838af2cb68e0f3c2cf2131 Suggested-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins Signed-off-by: Ingo Molnar <[email protected]> sched/nohz: Rewrite and fix load-avg computation -- again Thanks to Charles Wang for spotting the defects in the current code: - If we go idle during the sample window -- after sampling, we get a negative bias because we can negate our own sample. - If we wake up during the sample window we get a positive bias because we push the sample to a known active period. So rewrite the entire nohz load-avg muck once again, now adding copious documentation to the code. Reported-and-tested-by: Doug Smythies <[email protected]> Reported-and-tested-by: Charles Wang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins [ minor edits ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Change-Id: Ib735002c79eaace9fa5f7795fdc2f80b638b56a7 sched: Fix the broken sched_rr_get_interval() The caller of sched_sliced() should pass se.cfs_rq and se as the arguments, however in sched_rr_get_interval() we gave it rq.cfs_rq and se, which made the following computation obviously wrong. The change was introduced by commit: 77034937dc45 sched: fix crash in sys_sched_rr_get_interval() ... 5 years ago, while it had been the correct 'cfs_rq_of' before the commit. The change seems to be irrelevant to the commit msg, which was to return a 0 timeslice for tasks that are on an idle runqueue. So I believe that was just a plain typo. Signed-off-by: Zhu Yanhai <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] [ Since this is an ABI and an old bug, we'll test this via a slow upstream route, to hopefully discover any app breakage. ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix race in task_group() Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call task_group() too many times in set_task_rq()"). He found the reason to be that the multiple task_group() invocations in set_task_rq() returned different values. Looking at all that I found a lack of serialization and plain wrong comments. The below tries to fix it using an extra pointer which is updated under the appropriate scheduler locks. Its not pretty, but I can't really see another way given how all the cgroup stuff works. Change-Id: I6906c92ea6f66d19af5938a9f5ec9f1ff4577c39 Reported-and-tested-by: Stefan Bader <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed masks as and when necessary to ensure that the tasks belonging to the cpusets have some place (online CPUs) to run on. And regular CPU hotplug is destructive in the sense that the kernel doesn't remember the original cpuset configurations set by the user, across hotplug operations. However, suspend/resume (which uses CPU hotplug) is a special case in which the kernel has the responsibility to restore the system (during resume), to exactly the same state it was in before suspend. In order to achieve that, do the following: 1. Don't modify cpusets during suspend/resume. At all. In particular, don't move the tasks from one cpuset to another, and don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets during the CPU hotplug operations that are carried out in the suspend/resume path. 2. However, cpusets and sched domains are related. We just want to avoid altering cpusets alone. So, to keep the sched domains updated, build a single sched domain (containing all active cpus) during each of the CPU hotplug operations carried out in s/r path, effectively ignoring the cpusets' cpus_allowed masks. (Since userspace is frozen while doing all this, it will go unnoticed.) 3. During the last CPU online operation during resume, build the sched domains by looking up the (unaltered) cpusets' cpus_allowed masks. That will bring back the system to the same original state as it was in before suspend. Ultimately, this will not only solve the cpuset problem related to suspend resume (ie., restores the cpusets to exactly what it was before suspend, by not touching it at all) but also speeds up suspend/resume because we avoid running cpuset update code for every CPU being offlined/onlined. Change-Id: I12c3e6289d5aa069531f8f616d63412b38103d99 Signed-off-by: Srivatsa S. Bhat <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Preeti U Murthy <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones There is a small race between copy_process() and cgroup_attach_task() where child->se.parent,cfs_rq points to invalid (old) ones. parent doing fork() | someone moving the parent to another cgroup -------------------------------+--------------------------------------------- copy_process() + dup_task_struct() -> parent->se is copied to child->se. se.parent,cfs_rq of them point to old ones. cgroup_attach_task() + cgroup_task_migrate() -> parent->cgroup is updated. + cpu_cgroup_attach() + sched_move_task() + task_move_group_fair() +- set_task_rq() -> se.parent,cfs_rq of parent are updated. + cgroup_fork() -> parent->cgroup is copied to child->cgroup. (*1) + sched_fork() + task_fork_fair() -> se.parent,cfs_rq of child are accessed while they point to old ones. (*2) In the worst case, this bug can lead to "use-after-free" and cause a panic, because it's new cgroup's refcount that is incremented at (*1), so the old cgroup(and related data) can be freed before (*2). In fact, a panic caused by this bug was originally caught in RHEL6.4. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0 [...] Call Trace: [<ffffffff81051f25>] place_entity+0x75/0xa0 [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160 [<ffffffff81063c0b>] sched_fork+0x6b/0x140 [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450 [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130 [<ffffffff8106d2f4>] do_fork+0x94/0x460 [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100 [<ffffffff81009598>] sys_clone+0x28/0x30 [<ffffffff8100b393>] stub_clone+0x13/0x20 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Daisuke Nishimura <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/nohz: Fix rq->cpu_load[] calculations While investigating why the load-balancer did funny I found that the rq->cpu_load[] tables were completely screwy.. a bit more digging revealed that the updates that got through were missing ticks followed by a catchup of 2 ticks. The catchup assumes the cpu was idle during that time (since only nohz can cause missed ticks and the machine is idle etc..) this means that esp. the higher indices were significantly lower than they ought to be. The reason for this is that its not correct to compare against jiffies on every jiffy on any other cpu than the cpu that updates jiffies. This patch cludges around it by only doing the catch-up stuff from nohz_idle_balance() and doing the regular stuff unconditionally from the tick. Change-Id: Iac6eda1ded5074afb261ccc0fe8a4e70432bb031 Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Venkatesh Pallipadi <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/nohz: Fix rq->cpu_load calculations some more commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream. Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[] calculations") since while that fixed the busy case it regressed the mostly idle case. Add a callback from the nohz exit to also age the rq->cpu_load[] array. This closes the hole where either there was no nohz load balance pass during the nohz, or there was a 'significant' amount of idle time between the last nohz balance and the nohz exit. So we'll update unconditionally from the tick to not insert any accidental 0 load periods while busy, and we try and catch up from nohz idle balance and nohz exit. Both these are still prone to missing a jiffy, but that has always been the case. Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Venkatesh Pallipadi <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/core: Clear the root_domain cpumasks in init_rootdomain() root_domain::rto_mask allocated through alloc_cpumask_var() contains garbage data, and this may cause problems. For instance, when doing pull_rt_task(), it may do useless iterations if rto_mask retains some extra garbage bits. Worse still, this violates the isolated domain rule for clustered scheduling using cpuset, because the tasks (with all the cpus allowed) belongs to one root domain can be pulled away into another root domain. The patch cleans the garbage by using zalloc_cpumask_var() instead of alloc_cpumask_var() for root_domain::rto_mask allocation, thereby addressing the issues. Do the same thing for root_domain's other cpumask memembers: - span, - online; Signed-off-by: Xunlei Pang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> softirq: Reduce latencies In various network workloads, __do_softirq() latencies can be up to 20 ms if HZ=1000, and 200 ms if HZ=100. This is because we iterate 10 times in the softirq dispatcher, and some actions can consume a lot of cycles. This patch changes the fallback to ksoftirqd condition to: - A time limit of 2 ms, - need_resched() being set on current task; When one of this condition is met, we wakeup ksoftirqd for further softirq processing if we still have pending softirqs. Using need_resched() as the only condition can trigger RCU stalls, as we can keep BH disabled for too long. I ran several benchmarks and got no significant difference in throughput, but a very significant reduction of latencies (one order of magnitude): In following bench, 200 antagonist "netperf -t TCP_RR" are started in background, using all available cpus. Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC IRQ (hard + soft). Before patch: RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=550110.424 MIN_LATENCY=146858 MAX_LATENCY=997109 P50_LATENCY=305000 P90_LATENCY=550000 P99_LATENCY=710000 MEAN_LATENCY=376989.12 STDDEV_LATENCY=184046.92 After patch: RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=40545.492 MIN_LATENCY=9834 MAX_LATENCY=78366 P50_LATENCY=33583 P90_LATENCY=59000 P99_LATENCY=69000 MEAN_LATENCY=38364.67 STDDEV_LATENCY=12865.26 Signed-off-by: Eric Dumazet <[email protected]> Cc: David Miller <[email protected]> Cc: Tom Herbert <[email protected]> Cc: Ben Hutchings <[email protected]> Signed-off-by: David S. Miller <[email protected]> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <[email protected]> Signed-off-by: Zefan Li <[email protected]> softirq: Fix lockup related to stop_machine being stuck in __do_softirq. The stop machine logic can lock up if all but one of the migration threads make it through the disable-irq step and the one remaining thread gets stuck in __do_softirq. The reason __do_softirq can hang is that it has a bail-out based on jiffies timeout, but in the lockup case, jiffies itself is not incremented. To work around this, re-add the max_restart counter in __do_irq and stop processing irqs after 10 restarts. Thanks to Tejun Heo and Rusty Russell and others for helping me track this down. This was introduced in 3.9 by commit c10d73671ad3 ("softirq: reduce latencies"). It may be worth looking into ath9k to see if it has issues with its irq handler at a later date. The hang stack traces look something like this: ------------[ cut here ]------------ WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7() Watchdog detected hard LOCKUP on cpu 2 Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] Pid: 23, comm: migration/2 Tainted: G C 3.9.4+ #11 Call Trace: <NMI> warn_slowpath_common+0x85/0x9f warn_slowpath_fmt+0x46/0x48 watchdog_overflow_callback+0x9c/0xa7 __perf_event_overflow+0x137/0x1cb perf_event_overflow+0x14/0x16 intel_pmu_handle_irq+0x2dc/0x359 perf_event_nmi_handler+0x19/0x1b nmi_handle+0x7f/0xc2 do_nmi+0xbc/0x304 end_repeat_nmi+0x1e/0x2e <<EOE>> cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 ---[ end trace 4947dfa9b0a4cec3 ]--- BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17] Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] irq event stamp: 835637905 hardirqs last enabled at (835637904): __do_softirq+0x9f/0x257 hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80 softirqs last enabled at (5654720): __do_softirq+0x1ff/0x257 softirqs last disabled at (5654725): irq_exit+0x5f/0xbb CPU 1 Pid: 17, comm: migration/1 Tainted: G WC 3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: tasklet_hi_action+0xf0/0xf0 Process migration/1 Call Trace: <IRQ> __do_softirq+0x117/0x257 irq_exit+0x5f/0xbb smp_apic_timer_interrupt+0x8a/0x98 apic_timer_interrupt+0x72/0x80 <EOI> printk+0x4d/0x4f stop_machine_cpu_stop+0x22c/0x274 cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 Change-Id: Ib213d5a5c59c85ae16afde6f42a01e39d2c93fd7 Signed-off-by: Ben Greear <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Pekka Riikonen <[email protected]> Cc: Eric Dumazet <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <[email protected]> Signed-off-by: Zefan Li <[email protected]>
alesaiko
pushed a commit
that referenced
this issue
Mar 26, 2018
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior of waking their tasks up on idle CPUs. The feature has too much of a negative impact on other workloads however to apply globally. The PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this flag set, or tasks woken by tasks with this flag set, on an idle CPU if one is available. Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039 Signed-off-by: Steve Muckle <[email protected]> sched: add sysctl for controlling task migrations on wake The PF_WAKE_UP_IDLE per-task flag made it impossible to enable the old behavior of SD_SHARE_PKG_RESOURCES, where every task migrates to an idle CPU on wakeup. The sched_wake_to_idle sysctl value, when made nonzero, will cause all tasks to migrate to an idle CPU if one is available when the task is woken up. This is regardless of how PF_WAKE_UP_IDLE is configured for tasks in the system. Similar to PF_WAKE_UP_IDLE, the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled for the sysctl value to have an effect. Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61 Signed-off-by: Steve Muckle <[email protected]> (cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282) Conflicts: kernel/sched/fair.c sched: Reset rq->next_interval before going idle next_balance, the point in jiffy time scale when a cpu will next load balance, could have been calculated when the cpu was busy. A busy cpu will apply its sched domain's busy_factor (usually > 1) in computing next_balance for that sched domain, which causes the (busy) cpu to load balance less frequently in its sched domains. However when the same cpu is going idle, its next_balance needs to be reset without consideration of busy_factor. Failure to do so would not trigger nohz idle balancer on that cpu for unnecessarily long time (introducing additional scheduling latencies for tasks). Fix bug in scheduler which aims to reset next_balance before a cpu goes idle (as per existing comment) but is clearly not doing so. Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4 Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Fix SCHED_HRTICK bug leading to late preemption of tasks SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot (just when they would have exceeded their ideal_runtime). It makes use of a a per-cpu hrtimer resource and hence alarming that hrtimer should be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs, rather than being based on number of tasks in a particular cfs_rq (as implemented currently). As a result, with current code, its possible for a running task (which is the sole task in its cfs_rq) to be preempted much after its ideal_runtime has elapsed, resulting in increased latency for tasks in other cfs_rq on same cpu. Fix this by alarming sched hrtimer based on total number of SCHED_FAIR tasks a CPU has across its various cfs_rqs. Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Mark schedule_io_timeout() with EXPORT_SYMBOL Make schedule_io_timeout() visible to modules. Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2 Signed-off-by: Jordan Crouse <[email protected]> sched: fix reference to wrong cfs_rq Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption of tasks) introduced a bug in sched_slice() calculation by using wrong cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather than the correct one to which they belonged. Fix the bug by using correct cfs_rq for tasks. Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Make sure to not re-read variables after validation We could re-read rq->rt_avg after we validated it was smaller than total, invalidating the check and resulting in an unintended negative. Signed-off-by: Peter Zijlstra <[email protected]> Cc: David Rientjes <[email protected]> Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins Signed-off-by: Ingo Molnar <[email protected]> CRs-Fixed: 497236 Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Srivatsa Vaddagiri <[email protected]> (cherry picked from commit fef154fa32c5d8ef992f11e11919495979970678) Change-Id: Ieac7a10d9f4ceb701660f8055efd7a0ed5d939ca Signed-off-by: Sridhar Nelloru <[email protected]> sched: re-calculate a cpu's next_balance point upon sched domain changes Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset a cpu's rq->next_balance when pulled_task = 0, which will be true when the cpu failed to pull any task, causing it go idle. However that patch relied on next_balance being calculated as a result of traversing cpu's sched domain hierarchy. A cpu that is the only online cpu will however not be attached to any sched domain hierarchy. When such a cpu calls into idle_balance(), we will end up initializing next_balance to be 1sec away! Such a CPU will defer load balance check for another 1sec, even though we may bring up more cpus in the meantime requiring it to check for load imbalance more frequently. This could then lead to increased scheduling latency for some tasks. This patch results in a cpu's next_balance being re-calculated when its attaching to a new sched domain hierarchy. This should let cpus call load balance checks at the right time we expect them to! Signed-off-by: Srivatsa Vaddagiri <[email protected]> (cherry picked from commit 91b609bdd91513c37c410ffef0204f8a0578e4f8) Change-Id: Ib457f1a357aad26e49837a01311b73e22e0cc26f Signed-off-by: Sridhar Nelloru <[email protected]> sched: Unthrottle rt runqueues in __disable_runtime() migrate_tasks() uses _pick_next_task_rt() to get tasks from the real-time runqueues to be migrated. When rt_rq is throttled _pick_next_task_rt() won't return anything, in which case migrate_tasks() can't move all threads over and gets stuck in an infinite loop. Instead unthrottle rt runqueues before migrating tasks. Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair() Signed-off-by: Peter Boonstoppel <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix clear NOHZ_BALANCE_KICK I have faced a sequence where the Idle Load Balance was sometime not triggered for a while on my platform, in the following scenario: CPU 0 and CPU 1 are running tasks and CPU 2 is idle CPU 1 kicks the Idle Load Balance CPU 1 selects CPU 2 as the new Idle Load Balancer CPU 2 sets NOHZ_BALANCE_KICK for CPU 2 CPU 2 sends a reschedule IPI to CPU 2 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance task A quickly goes back to sleep (before a tick occurs on CPU 2) CPU 2 goes back to idle with NOHZ_BALANCE_KICK set Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be performed. We must wait for the sched softirq to be raised on CPU 2 thanks to another part the kernel to come back to clear NOHZ_BALANCE_KICK. The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if we can't raise the sched_softirq for the Idle Load Balance. Change since V1: - move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB can't run on this CPU (as suggested by Peter) Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 873b4c65b519fd769940eb281f77848227d4e5c1 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [[email protected]: minor merge resolution for 3.4 in scheduler_ipi()] Signed-off-by: Steve Muckle <[email protected]> Change-Id: I3548612057cccc2ecc29429c129c44183083831f sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s try_to_wake_up_local() should only be invoked to wake up another task in the same runqueue and BUG_ON()s are used to enforce the rule. Missing try_to_wake_up_local() can stall workqueue execution but such stalls are likely to be finite either by another work item being queued or the one blocked getting unblocked. There's no reason to trigger BUG while holding rq lock crashing the whole system. Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s. Change-Id: I75fdaaf4dcaefcf3893e0404d98c8aa91f89934d Signed-off-by: Tejun Heo <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Git-commit: 383efcd00053ec40023010ce5034bd702e7ab373 CRs-Fixed: 519942 Signed-off-by: Steve Muckle <[email protected]> sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local() The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local() were recently converted from BUG_ON() calls. If these hit it indicates something is wrong and that may contribute to other system instability. To eliminate the risk of an instance of one of these errors going un-noticed because there was an earlier instance that occured long ago, change to WARN_ON(). If there ever is a flood of these there are bigger problems. Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7 Signed-off-by: Steve Muckle <[email protected]> sched: Convert WARN_ON() to printk_sched() in try_to_wake_up_local() try_to_wake_up_local() is called with the rq lock held. Printing to console in this context can result in a deadlock if klogd needs to be woken up. Print to the kernel log buffer via printk_sched() instead which avoids the wakeup. Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c Signed-off-by: Syed Rameez Mustafa <[email protected]> sched/rt: Use root_domain of rt_rq not current processor When the system has multiple domains do_sched_rt_period_timer() can run on any CPU and may iterate over all rt_rq in cpu_online_mask. This means when balance_runtime() is run for a given rt_rq that rt_rq may be in a different rd than the current processor. Thus if we use smp_processor_id() to get rd in do_balance_runtime() we may borrow runtime from a rt_rq that is not part of our rd. This changes do_balance_runtime to get the rd from the passed in rt_rq ensuring that we borrow runtime only from the correct rd for the given rt_rq. This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime when we try reclaim runtime lent to other rt_rq but runtime has been lent to a rt_rq in another rd. Change-Id: Id8b9f88bdd783ef87bfa8d288a2fe7e86dd0b1e2 Signed-off-by: Shawn Bohrer <[email protected]> Acked-by: Steven Rostedt <[email protected]> Acked-by: Mike Galbraith <[email protected]> Cc: [email protected] Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 89960feebaf4f9a53f93a0ce6888207e4a808799 Git-repo: git://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git Signed-off-by: Syed Rameez Mustafa <[email protected]> sched: Remove redundant update_runtime notifier migration_call() will do all the things that update_runtime() does. So let's remove it. Furthermore, there is potential risk that the current code will catch BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime threads running because of enabling runtime twice while the rt_runtime may already changed. Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8 Signed-off-by: Neil Zhang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [[email protected]: resolved trivial header file context conflict] Signed-off-by: Matt Wagantall <[email protected]> sched: Fix hotplug vs. set_cpus_allowed_ptr() Lai found that: WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b() ... migration_cpu_stop+0x1d/0x22 was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is always a sub-set of cpu_online_mask. This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness"). So set active and online at the same time to avoid this particular problem. CRs-Fixed: 680496 Change-Id: I89ac9b6829acf200072975bc7d028a469167f083 Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness") Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Gautham R. Shenoy <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael wang <[email protected]> Cc: Paul Gortmaker <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Toshi Kani <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 24d52daafcd53f109893ec2faf668c8eeef4a382 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Matt Wagantall <[email protected]> sched: Fix race between try_to_wake_up() and move_task() Until a task's state has been seen as interruptible/uninterruptible and it is no longer on_cpu, it is possible that the task may move to another CPU (load balancing may cause this). Here is an example where the race condition results in incorrect operation: - cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING - cpu 0 runs task B, which attempts to wake up A - cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0 - cpu 1 then pulls task A (perhaps due to idle balance) - cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE - cpu 0 continues in try_to_wake_up(), thinking task A's previous cpu is 0, where it is actually 1 - if select_task_rq returns cpu 0, task A will be woken up on cpu 0 without properly updating its cpu to 0 in set_task_cpu() CRs-Fixed: 665958 Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99 Signed-off-by: Steve Muckle <[email protected]> sched: Fix reference to stale task_struct in try_to_wake_up() try_to_wake_up() currently drops p->pi_lock and later checks for need to notify cpufreq governor on task migrations or wakeups. However the woken task could exit between the time p->pi_lock is released and the time the test for notification is run. As a result, the test for notification could refer to an exited task. task_notify_on_migrate(p) could thus lead to invalid memory reference. Fix this by running the test for notification with task's pi_lock held. Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1 Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Bail out of yield_to when source and target runqueue has one task In case of undercomitted scenarios, especially in large guests yield_to overhead is significantly high. when run queue length of source and target is one, take an opportunity to bail out and return -ESRCH. This return condition can be further exploited to quickly come out of PLE handler. (History: Raghavendra initially worked on break out of kvm ple handler upon seeing source runqueue length = 1, but it had to export rq length). Peter came up with the elegant idea of return -ESRCH in scheduler core. Signed-off-by: Peter Zijlstra <[email protected]> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi) Reviewed-by: Srikar Dronamraju <[email protected]> Signed-off-by: Raghavendra K T <[email protected]> Acked-by: Andrew Jones <[email protected]> Tested-by: Chegu Vinod <[email protected]> Signed-off-by: Gleb Natapov <[email protected]> Change-Id: I4eca8afc683c1a4b9d2490422baacd3da76c1e92 sched: Fix select_idle_sibling() bouncing cow syndrome If the previous CPU is cache affine and idle, select it. The current implementation simply traverses the sd_llc domain, taking the first idle CPU encountered, which walks buddy pairs hand in hand over the package, inflicting excruciating pain. 1 tbench pair (worst case) in a 10 core + SMT package: pre 15.22 MB/sec 1 procs post 252.01 MB/sec 1 procs Signed-off-by: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix double normalization of vruntime dequeue_entity() is called when p->on_rq and sets se->on_rq = 0 which appears to guarentee that the !se->on_rq condition is met. If the task has done set_current_state(TASK_INTERRUPTIBLE) without schedule() the second condition will be met and vruntime will be incorrectly adjusted twice. In certain cases this can result in the task's vruntime never increasing past the vruntime of other tasks on the CFS' run queue, starving them of CPU time. This patch changes switched_from_fair() to use !p->on_rq instead of !se->on_rq. I'm able to cause a task with a priority of 120 to starve all other tasks with the same priority on an ARM platform running 3.2.51-rt72 PREEMPT RT by writing one character at time to a serial tty (16550 UART) in a tight loop. I'm also able to verify making this change corrects the problem on that platform and kernel version. Change-Id: I70aa0fe0e9be146b1a2a903d010b318390262dc3 Signed-off-by: George McCollister <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix divide by zero at {thread_group,task}_times On architectures where cputime_t is 64 bit type, is possible to trigger divide by zero on do_div(temp, (__force u32) total) line, if total is a non zero number but has lower 32 bit's zeroed. Removing casting is not a good solution since some do_div() implementations do cast to u32 internally. This problem can be triggered in practice on very long lived processes: PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin" #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2 #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00 #3 [ffff880472a51cd0] die at ffffffff8100f26b #4 [ffff880472a51d00] do_trap at ffffffff814f03f4 #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff #6 [ffff880472a51e00] divide_error at ffffffff8100be7b [exception RIP: thread_group_times+0x56] RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046 RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800 R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08 R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d #8 [ffff880472a51f40] sys_times at ffffffff81088524 #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2 RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202 RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000 RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0 RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940 R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940 ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b Change-Id: I1e0fa1fa38b86566f5b20e5c0efa0c9eb17b203b Signed-off-by: Stanislaw Gruszka <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched_clock: Prevent 64bit inatomicity on 32bit systems The sched_clock_remote() implementation has the following inatomicity problem on 32bit systems when accessing the remote scd->clock, which is a 64bit value. CPU0 CPU1 sched_clock_local() sched_clock_remote(CPU0) ... remote_clock = scd[CPU0]->clock read_low32bit(scd[CPU0]->clock) cmpxchg64(scd->clock,...) read_high32bit(scd[CPU0]->clock) While the update of scd->clock is using an atomic64 mechanism, the readout on the remote cpu is not, which can cause completely bogus readouts. It is a quite rare problem, because it requires the update to hit the narrow race window between the low/high readout and the update must go across the 32bit boundary. The resulting misbehaviour is, that CPU1 will see the sched_clock on CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value to this bogus timestamp. This stays that way due to the clamping implementation for about 4 seconds until the synchronization with CLOCK_MONOTONIC undoes the problem. The issue is hard to observe, because it might only result in a less accurate SCHED_OTHER timeslicing behaviour. To create observable damage on realtime scheduling classes, it is necessary that the bogus update of CPU1 sched_clock happens in the context of an realtime thread, which then gets charged 4 seconds of RT runtime, which results in the RT throttler mechanism to trigger and prevent scheduling of RT tasks for a little less than 4 seconds. So this is quite unlikely as well. The issue was quite hard to decode as the reproduction time is between 2 days and 3 weeks and intrusive tracing makes it less likely, but the following trace recorded with trace_clock=global, which uses sched_clock_local(), gave the final hint: <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 What happens is that CPU0 goes idle and invokes sched_clock_idle_sleep_event() which invokes sched_clock_local() and CPU1 runs a remote wakeup for CPU0 at the same time, which invokes sched_remote_clock(). The time jump gets propagated to CPU0 via sched_remote_clock() and stays stale on both cores for ~4 seconds. There are only two other possibilities, which could cause a stale sched clock: 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic wrong value. 2) sched_clock() which reads the TSC returns a sporadic wrong value. one jiffy and then go stale. forward. It would just result in a stale sched_clock for one jiffy. After quite some brain twisting and finding the same pattern on other traces, sched_clock_remote() remained the only place which could cause such a problem and as explained above it's indeed racy on 32bit systems. So while on 64bit systems the readout is atomic, we need to verify the remote readout on 32bit machines. We need to protect the local->clock readout in sched_clock_remote() on 32bit as well because an NMI could hit between the low and the high readout, call sched_clock_local() and modify local->clock. Thanks to Siegfried Wulsch for bearing with my debug requests and going through the tedious tasks of running a bunch of reproducer systems to generate the debug information which let me decode the issue. Change-Id: I767a6a645e17651b39d0a530dc0681756f3b4f5d Reported-by: Siegfried Wulsch <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities This patch touches the RT group scheduling case. Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's priority, while rt_rq passed to them may be not the top-level rt_rq. This is wrong, because changing of priority on a child level does not guarantee that the priority is the highest all over the rq. So, this leak makes RT balancing unusable. The short example: the task having the highest priority among all rq's RT tasks (no one other task has the same priority) are waking on a throttle rt_rq. The rq's cpupri is set to the task's priority equivalent, but real rq->rt.highest_prio.curr is less. The patch below fixes the problem. Change-Id: I01d3cc07870b1825efa1deed3b70f060ac20939f Signed-off-by: Kirill Tkhai <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> CC: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Guarantee new group-entities always have weight Currently, group entity load-weights are initialized to zero. This admits some races with respect to the first time they are re-weighted in earlty use. ( Let g[x] denote the se for "g" on cpu "x". ) Suppose that we have root->a and that a enters a throttled state, immediately followed by a[0]->t1 (the only task running on cpu[0]) blocking: put_prev_task(group_cfs_rq(a[0]), t1) put_prev_entity(..., t1) check_cfs_rq_runtime(group_cfs_rq(a[0])) throttle_cfs_rq(group_cfs_rq(a[0])) Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first time: enqueue_task_fair(rq[0], t2) enqueue_entity(group_cfs_rq(b[0]), t2) enqueue_entity_load_avg(group_cfs_rq(b[0]), t2) account_entity_enqueue(group_cfs_ra(b[0]), t2) update_cfs_shares(group_cfs_rq(b[0])) < skipped because b is part of a throttled hierarchy > enqueue_entity(group_cfs_rq(a[0]), b[0]) ... We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0 which violates invariants in several code-paths. Eliminate the possibility of this by initializing group entity weight. Change-Id: I1b09c5c4f85df1de565e7c862ba8dde65851c05f Signed-off-by: Paul Turner <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <[email protected]> Cc: Chris J Arges <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Fix isolated CPUs leaving root_task_group indefinitely throttled Root task group bandwidth replenishment must service all CPUs, regardless of where the timer was last started, and regardless of the isolation mechanism, lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy. Change-Id: I8485724e3ff0960008aa13634e55bcb6ceacbc01 Signed-off-by: Mike Galbraith <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Avoid updating RT entry timeout twice within one tick period The issue below was found in 2.6.34-rt rather than mainline rt kernel, but the issue still exists upstream as well. So please let me describe how it was noticed on 2.6.34-rt: On this version, each softirq has its own thread, which means there is at least one RT FIFO task per cpu. The priority of these tasks is set to 49 by default. If user launches an RT FIFO task with priority lower than 49 of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued one cpu runqueue at one moment. By current strategy of balancing RT tasks, when it comes to RT tasks, we really need to put them off to a CPU that they can run on as soon as possible. Even if it means a bit of cache line flushing, we want RT tasks to be run with the least latency. When the user RT FIFO task which just launched before is running, the sched timer tick of the current cpu happens. In this tick period, the timeout value of the user RT task will be updated once. Subsequently, we try to wake up one softirq RT task on its local cpu. As the priority of current user RT task is lower than the softirq RT task, the current task will be preempted by the higher priority softirq RT task. Before preemption, we check to see if current can readily move to a different cpu. If so, we will reschedule to allow the RT push logic to try to move current somewhere else. Whenever the woken softirq RT task runs, it first tries to migrate the user FIFO RT task over to a cpu that is running a task of lesser priority. If migration is done, it will send a reschedule request to the found cpu by IPI interrupt. Once the target cpu responds the IPI interrupt, it will pick the migrated user RT task to preempt its current task. When the user RT task is running on the new cpu, the sched timer tick of the cpu fires. So it will tick the user RT task again. This also means the RT task timeout value will be updated again. As the migration may be done in one tick period, it means the user RT task timeout value will be updated twice within one tick. If we set a limit on the amount of cpu time for the user RT task by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon reaching the soft limit. But exactly when the SIGXCPU signal should be sent depends on the RT task timeout value. In fact the timeout mechanism of sending the SIGXCPU signal assumes the RT task timeout is increased once every tick. However, currently the timeout value may be added twice per tick. So it results in the SIGXCPU signal being sent earlier than expected. To solve this issue, we prevent the timeout value from increasing twice within one tick time by remembering the jiffies value of last updating the timeout. As long as the RT task's jiffies is different with the global jiffies value, we allow its timeout to be updated. Signed-off-by: Ying Xue <[email protected]> Signed-off-by: Fan Du <[email protected]> Reviewed-by: Yong Zhang <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> [ lizf: backported to 3.4: adjust context ] Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Change-Id: I1d126959d88f3bc1adb4a707661efb7162048424 sched: rt: Reduce rq lock contention by eliminating locking of non-feasible target This patch adds checks that prevens futile attempts to move rt tasks to a CPU with active tasks of equal or higher priority. This reduces run queue lock contention and improves the performance of a well known OLTP benchmark by 0.7%. Change-Id: I232abe50623593d51be508fc64e8b865dcc77212 Signed-off-by: Tim Chen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Shawn Bohrer <[email protected]> Cc: Suruchi Kadu <[email protected]> Cc: Doug Nelson<[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Zefan Li <[email protected]> sched: Fix RLIMIT_RTTIME when PI-boosting to RT When non-realtime tasks get priority-inheritance boosted to a realtime scheduling class, RLIMIT_RTTIME starts to apply to them. However, the counter used for checking this (the same one used for SCHED_RR timeslices) was not getting reset. This meant that tasks running with a non-realtime scheduling class which are repeatedly boosted to a realtime one, but never block while they are running realtime, eventually hit the timeout without ever running for a time over the limit. This patch resets the realtime timeslice counter when un-PI-boosting from an RT to a non-RT scheduling class. I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT mutex which induces priority boosting and spins while boosted that gets killed by a SIGXCPU on non-fixed kernels but doesn't with this patch applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and does happen eventually with PREEMPT_VOLUNTARY kernels. Change-Id: I732036119517d9a9625cea4644b4509c1d6cdf13 Signed-off-by: Brian Silverman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> [lizf: Backported to 3.4: adjust contest] Signed-off-by: Zefan Li <[email protected]> sched: Queue RT tasks to head when prio drops The following scenario does not work correctly: Runqueue of CPUx contains two runnable and pinned tasks: T1: SCHED_FIFO, prio 80 T2: SCHED_FIFO, prio 80 T1 is on the cpu and executes the following syscalls (classic priority ceiling scenario): sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90); ... sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80); ... Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back to sleep the scheduler picks T2. Surprise! The same happens w/o actual preemption when T1 is forced into the scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes pick_next_task() which returns T2. So T1 gets preempted and scheduled out. This happens because sched_setscheduler() dequeues T1 from the prio 90 list and then enqueues it on the tail of the prio 80 list behind T2. This violates the POSIX spec and surprises user space which relies on the guarantee that SCHED_FIFO tasks are not scheduled out unless they give the CPU up voluntarily or are preempted by a higher priority task. In the latter case the preempted task must get back on the CPU after the preempting task schedules out again. We fixed a similar issue already in commit 60db48c (sched: Queue a deboosted task to the head of the RT prio queue). The same treatment is necessary for sched_setscheduler(). So enqueue to head of the prio bucket list if the priority of the task is lowered. It might be possible that existing user space relies on the current behaviour, but it can be considered highly unlikely due to the corner case nature of the application scenario. Change-Id: I65f159b3c2d265dcaf27b70bf9be88cf922ab845 Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Zefan Li <[email protected]> sched: Fix the theoretical signal_wake_up() vs schedule() race This is only theoretical, but after try_to_wake_up(p) was changed to check p->state under p->pi_lock the code like: __set_current_state(TASK_INTERRUPTIBLE); schedule(); can miss a signal. This is the special case of wait-for-condition, it relies on try_to_wake_up/schedule interaction and thus it does not need mb() between __set_current_state() and if (signal_pending). However, this __set_current_state() can move into the critical section protected by rq->lock, now that try_to_wake_up() takes another lock we need to ensure that it can't be reordered with "if (signal_pending(current))" check inside that section. The patch is actually one-liner, it simply adds smp_wmb() before spin_lock_irq(rq->lock). This is what try_to_wake_up() already does by the same reason. We turn this wmb() into the new helper, smp_mb__before_spinlock(), for better documentation and to allow the architectures to change the default implementation. While at it, kill smp_mb__after_lock(), it has no callers. Perhaps we can also add smp_mb__before/after_spinunlock() for prepare_to_wait(). Change-Id: I2b329b2ae213cc9221811ca244d2ebb044011a1b Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix race in idle_balance() The scheduler main function 'schedule()' checks if there are no more tasks on the runqueue. Then it checks if a task should be pulled in the current runqueue in idle_balance() assuming it will go to idle otherwise. But idle_balance() releases the rq->lock in order to look up the sched domains and takes the lock again right after. That opens a window where another cpu may put a task in our runqueue, so we won't go to idle but we have filled the idle_stamp, thinking we will. This patch closes the window by checking if the runqueue has been modified but without pulling a task after taking the lock again, so we won't go to idle right after in the __schedule() function. Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8 Signed-off-by: Daniel Lezcano <[email protected]> Cc: [email protected] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix load avg vs cpu-hotplug Rabik and Paul reported two different issues related to the same few lines of code. Rabik's issue is that the nr_uninterruptible migration code is wrong in that he sees artifacts due to this (Rabik please do expand in more detail). Paul's issue is that this code as it stands relies on us using stop_machine() for unplug, we all would like to remove this assumption so that eventually we can remove this stop_machine() usage altogether. The only reason we'd have to migrate nr_uninterruptible is so that we could use for_each_online_cpu() loops in favour of for_each_possible_cpu() loops, however since nr_uninterruptible() is the only such loop and its using possible lets not bother at all. The problem Rabik sees is (probably) caused by the fact that by migrating nr_uninterruptible we screw rq->calc_load_active for both rqs involved. So don't bother with fancy migration schemes (meaning we now have to keep using for_each_possible_cpu()) and instead fold any nr_active delta after we migrate all tasks away to make sure we don't have any skewed nr_active accounting. Reported-by: Rakib Mullick <[email protected]> Reported-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins Signed-off-by: Ingo Molnar <[email protected]> sched: Fix load avg vs. cpu-hotplug Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an incomplete fix: In particular, the problem is that at the point it calls calc_load_migrate() nr_running := 1 (the stopper thread), so move the call to CPU_DEAD where we're sure that nr_running := 0. Also note that we can call calc_load_migrate() without serialization, we know the state of rq is stable since its cpu is dead, and we modify the global state using appropriate atomic ops. Change-Id: I24063eebf4429d7e01838af2cb68e0f3c2cf2131 Suggested-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins Signed-off-by: Ingo Molnar <[email protected]> sched/nohz: Rewrite and fix load-avg computation -- again Thanks to Charles Wang for spotting the defects in the current code: - If we go idle during the sample window -- after sampling, we get a negative bias because we can negate our own sample. - If we wake up during the sample window we get a positive bias because we push the sample to a known active period. So rewrite the entire nohz load-avg muck once again, now adding copious documentation to the code. Reported-and-tested-by: Doug Smythies <[email protected]> Reported-and-tested-by: Charles Wang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins [ minor edits ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Change-Id: Ib735002c79eaace9fa5f7795fdc2f80b638b56a7 sched: Fix the broken sched_rr_get_interval() The caller of sched_sliced() should pass se.cfs_rq and se as the arguments, however in sched_rr_get_interval() we gave it rq.cfs_rq and se, which made the following computation obviously wrong. The change was introduced by commit: 77034937dc45 sched: fix crash in sys_sched_rr_get_interval() ... 5 years ago, while it had been the correct 'cfs_rq_of' before the commit. The change seems to be irrelevant to the commit msg, which was to return a 0 timeslice for tasks that are on an idle runqueue. So I believe that was just a plain typo. Signed-off-by: Zhu Yanhai <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] [ Since this is an ABI and an old bug, we'll test this via a slow upstream route, to hopefully discover any app breakage. ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix race in task_group() Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call task_group() too many times in set_task_rq()"). He found the reason to be that the multiple task_group() invocations in set_task_rq() returned different values. Looking at all that I found a lack of serialization and plain wrong comments. The below tries to fix it using an extra pointer which is updated under the appropriate scheduler locks. Its not pretty, but I can't really see another way given how all the cgroup stuff works. Change-Id: I6906c92ea6f66d19af5938a9f5ec9f1ff4577c39 Reported-and-tested-by: Stefan Bader <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed masks as and when necessary to ensure that the tasks belonging to the cpusets have some place (online CPUs) to run on. And regular CPU hotplug is destructive in the sense that the kernel doesn't remember the original cpuset configurations set by the user, across hotplug operations. However, suspend/resume (which uses CPU hotplug) is a special case in which the kernel has the responsibility to restore the system (during resume), to exactly the same state it was in before suspend. In order to achieve that, do the following: 1. Don't modify cpusets during suspend/resume. At all. In particular, don't move the tasks from one cpuset to another, and don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets during the CPU hotplug operations that are carried out in the suspend/resume path. 2. However, cpusets and sched domains are related. We just want to avoid altering cpusets alone. So, to keep the sched domains updated, build a single sched domain (containing all active cpus) during each of the CPU hotplug operations carried out in s/r path, effectively ignoring the cpusets' cpus_allowed masks. (Since userspace is frozen while doing all this, it will go unnoticed.) 3. During the last CPU online operation during resume, build the sched domains by looking up the (unaltered) cpusets' cpus_allowed masks. That will bring back the system to the same original state as it was in before suspend. Ultimately, this will not only solve the cpuset problem related to suspend resume (ie., restores the cpusets to exactly what it was before suspend, by not touching it at all) but also speeds up suspend/resume because we avoid running cpuset update code for every CPU being offlined/onlined. Change-Id: I12c3e6289d5aa069531f8f616d63412b38103d99 Signed-off-by: Srivatsa S. Bhat <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Preeti U Murthy <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones There is a small race between copy_process() and cgroup_attach_task() where child->se.parent,cfs_rq points to invalid (old) ones. parent doing fork() | someone moving the parent to another cgroup -------------------------------+--------------------------------------------- copy_process() + dup_task_struct() -> parent->se is copied to child->se. se.parent,cfs_rq of them point to old ones. cgroup_attach_task() + cgroup_task_migrate() -> parent->cgroup is updated. + cpu_cgroup_attach() + sched_move_task() + task_move_group_fair() +- set_task_rq() -> se.parent,cfs_rq of parent are updated. + cgroup_fork() -> parent->cgroup is copied to child->cgroup. (*1) + sched_fork() + task_fork_fair() -> se.parent,cfs_rq of child are accessed while they point to old ones. (*2) In the worst case, this bug can lead to "use-after-free" and cause a panic, because it's new cgroup's refcount that is incremented at (*1), so the old cgroup(and related data) can be freed before (*2). In fact, a panic caused by this bug was originally caught in RHEL6.4. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0 [...] Call Trace: [<ffffffff81051f25>] place_entity+0x75/0xa0 [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160 [<ffffffff81063c0b>] sched_fork+0x6b/0x140 [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450 [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130 [<ffffffff8106d2f4>] do_fork+0x94/0x460 [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100 [<ffffffff81009598>] sys_clone+0x28/0x30 [<ffffffff8100b393>] stub_clone+0x13/0x20 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Daisuke Nishimura <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/nohz: Fix rq->cpu_load[] calculations While investigating why the load-balancer did funny I found that the rq->cpu_load[] tables were completely screwy.. a bit more digging revealed that the updates that got through were missing ticks followed by a catchup of 2 ticks. The catchup assumes the cpu was idle during that time (since only nohz can cause missed ticks and the machine is idle etc..) this means that esp. the higher indices were significantly lower than they ought to be. The reason for this is that its not correct to compare against jiffies on every jiffy on any other cpu than the cpu that updates jiffies. This patch cludges around it by only doing the catch-up stuff from nohz_idle_balance() and doing the regular stuff unconditionally from the tick. Change-Id: Iac6eda1ded5074afb261ccc0fe8a4e70432bb031 Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Venkatesh Pallipadi <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/nohz: Fix rq->cpu_load calculations some more commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream. Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[] calculations") since while that fixed the busy case it regressed the mostly idle case. Add a callback from the nohz exit to also age the rq->cpu_load[] array. This closes the hole where either there was no nohz load balance pass during the nohz, or there was a 'significant' amount of idle time between the last nohz balance and the nohz exit. So we'll update unconditionally from the tick to not insert any accidental 0 load periods while busy, and we try and catch up from nohz idle balance and nohz exit. Both these are still prone to missing a jiffy, but that has always been the case. Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Venkatesh Pallipadi <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/core: Clear the root_domain cpumasks in init_rootdomain() root_domain::rto_mask allocated through alloc_cpumask_var() contains garbage data, and this may cause problems. For instance, when doing pull_rt_task(), it may do useless iterations if rto_mask retains some extra garbage bits. Worse still, this violates the isolated domain rule for clustered scheduling using cpuset, because the tasks (with all the cpus allowed) belongs to one root domain can be pulled away into another root domain. The patch cleans the garbage by using zalloc_cpumask_var() instead of alloc_cpumask_var() for root_domain::rto_mask allocation, thereby addressing the issues. Do the same thing for root_domain's other cpumask memembers: - span, - online; Signed-off-by: Xunlei Pang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> softirq: Reduce latencies In various network workloads, __do_softirq() latencies can be up to 20 ms if HZ=1000, and 200 ms if HZ=100. This is because we iterate 10 times in the softirq dispatcher, and some actions can consume a lot of cycles. This patch changes the fallback to ksoftirqd condition to: - A time limit of 2 ms, - need_resched() being set on current task; When one of this condition is met, we wakeup ksoftirqd for further softirq processing if we still have pending softirqs. Using need_resched() as the only condition can trigger RCU stalls, as we can keep BH disabled for too long. I ran several benchmarks and got no significant difference in throughput, but a very significant reduction of latencies (one order of magnitude): In following bench, 200 antagonist "netperf -t TCP_RR" are started in background, using all available cpus. Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC IRQ (hard + soft). Before patch: RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=550110.424 MIN_LATENCY=146858 MAX_LATENCY=997109 P50_LATENCY=305000 P90_LATENCY=550000 P99_LATENCY=710000 MEAN_LATENCY=376989.12 STDDEV_LATENCY=184046.92 After patch: RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=40545.492 MIN_LATENCY=9834 MAX_LATENCY=78366 P50_LATENCY=33583 P90_LATENCY=59000 P99_LATENCY=69000 MEAN_LATENCY=38364.67 STDDEV_LATENCY=12865.26 Signed-off-by: Eric Dumazet <[email protected]> Cc: David Miller <[email protected]> Cc: Tom Herbert <[email protected]> Cc: Ben Hutchings <[email protected]> Signed-off-by: David S. Miller <[email protected]> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <[email protected]> Signed-off-by: Zefan Li <[email protected]> softirq: Fix lockup related to stop_machine being stuck in __do_softirq. The stop machine logic can lock up if all but one of the migration threads make it through the disable-irq step and the one remaining thread gets stuck in __do_softirq. The reason __do_softirq can hang is that it has a bail-out based on jiffies timeout, but in the lockup case, jiffies itself is not incremented. To work around this, re-add the max_restart counter in __do_irq and stop processing irqs after 10 restarts. Thanks to Tejun Heo and Rusty Russell and others for helping me track this down. This was introduced in 3.9 by commit c10d73671ad3 ("softirq: reduce latencies"). It may be worth looking into ath9k to see if it has issues with its irq handler at a later date. The hang stack traces look something like this: ------------[ cut here ]------------ WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7() Watchdog detected hard LOCKUP on cpu 2 Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] Pid: 23, comm: migration/2 Tainted: G C 3.9.4+ #11 Call Trace: <NMI> warn_slowpath_common+0x85/0x9f warn_slowpath_fmt+0x46/0x48 watchdog_overflow_callback+0x9c/0xa7 __perf_event_overflow+0x137/0x1cb perf_event_overflow+0x14/0x16 intel_pmu_handle_irq+0x2dc/0x359 perf_event_nmi_handler+0x19/0x1b nmi_handle+0x7f/0xc2 do_nmi+0xbc/0x304 end_repeat_nmi+0x1e/0x2e <<EOE>> cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 ---[ end trace 4947dfa9b0a4cec3 ]--- BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17] Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] irq event stamp: 835637905 hardirqs last enabled at (835637904): __do_softirq+0x9f/0x257 hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80 softirqs last enabled at (5654720): __do_softirq+0x1ff/0x257 softirqs last disabled at (5654725): irq_exit+0x5f/0xbb CPU 1 Pid: 17, comm: migration/1 Tainted: G WC 3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: tasklet_hi_action+0xf0/0xf0 Process migration/1 Call Trace: <IRQ> __do_softirq+0x117/0x257 irq_exit+0x5f/0xbb smp_apic_timer_interrupt+0x8a/0x98 apic_timer_interrupt+0x72/0x80 <EOI> printk+0x4d/0x4f stop_machine_cpu_stop+0x22c/0x274 cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 Change-Id: Ib213d5a5c59c85ae16afde6f42a01e39d2c93fd7 Signed-off-by: Ben Greear <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Pekka Riikonen <[email protected]> Cc: Eric Dumazet <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <[email protected]> Signed-off-by: Zefan Li <[email protected]>
alesaiko
pushed a commit
that referenced
this issue
Apr 22, 2018
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior of waking their tasks up on idle CPUs. The feature has too much of a negative impact on other workloads however to apply globally. The PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this flag set, or tasks woken by tasks with this flag set, on an idle CPU if one is available. Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039 Signed-off-by: Steve Muckle <[email protected]> sched: add sysctl for controlling task migrations on wake The PF_WAKE_UP_IDLE per-task flag made it impossible to enable the old behavior of SD_SHARE_PKG_RESOURCES, where every task migrates to an idle CPU on wakeup. The sched_wake_to_idle sysctl value, when made nonzero, will cause all tasks to migrate to an idle CPU if one is available when the task is woken up. This is regardless of how PF_WAKE_UP_IDLE is configured for tasks in the system. Similar to PF_WAKE_UP_IDLE, the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled for the sysctl value to have an effect. Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61 Signed-off-by: Steve Muckle <[email protected]> (cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282) Conflicts: kernel/sched/fair.c sched: Reset rq->next_interval before going idle next_balance, the point in jiffy time scale when a cpu will next load balance, could have been calculated when the cpu was busy. A busy cpu will apply its sched domain's busy_factor (usually > 1) in computing next_balance for that sched domain, which causes the (busy) cpu to load balance less frequently in its sched domains. However when the same cpu is going idle, its next_balance needs to be reset without consideration of busy_factor. Failure to do so would not trigger nohz idle balancer on that cpu for unnecessarily long time (introducing additional scheduling latencies for tasks). Fix bug in scheduler which aims to reset next_balance before a cpu goes idle (as per existing comment) but is clearly not doing so. Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4 Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Fix SCHED_HRTICK bug leading to late preemption of tasks SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot (just when they would have exceeded their ideal_runtime). It makes use of a a per-cpu hrtimer resource and hence alarming that hrtimer should be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs, rather than being based on number of tasks in a particular cfs_rq (as implemented currently). As a result, with current code, its possible for a running task (which is the sole task in its cfs_rq) to be preempted much after its ideal_runtime has elapsed, resulting in increased latency for tasks in other cfs_rq on same cpu. Fix this by alarming sched hrtimer based on total number of SCHED_FAIR tasks a CPU has across its various cfs_rqs. Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Mark schedule_io_timeout() with EXPORT_SYMBOL Make schedule_io_timeout() visible to modules. Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2 Signed-off-by: Jordan Crouse <[email protected]> sched: fix reference to wrong cfs_rq Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption of tasks) introduced a bug in sched_slice() calculation by using wrong cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather than the correct one to which they belonged. Fix the bug by using correct cfs_rq for tasks. Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Make sure to not re-read variables after validation We could re-read rq->rt_avg after we validated it was smaller than total, invalidating the check and resulting in an unintended negative. Signed-off-by: Peter Zijlstra <[email protected]> Cc: David Rientjes <[email protected]> Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins Signed-off-by: Ingo Molnar <[email protected]> CRs-Fixed: 497236 Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Srivatsa Vaddagiri <[email protected]> (cherry picked from commit fef154fa32c5d8ef992f11e11919495979970678) Change-Id: Ieac7a10d9f4ceb701660f8055efd7a0ed5d939ca Signed-off-by: Sridhar Nelloru <[email protected]> sched: re-calculate a cpu's next_balance point upon sched domain changes Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset a cpu's rq->next_balance when pulled_task = 0, which will be true when the cpu failed to pull any task, causing it go idle. However that patch relied on next_balance being calculated as a result of traversing cpu's sched domain hierarchy. A cpu that is the only online cpu will however not be attached to any sched domain hierarchy. When such a cpu calls into idle_balance(), we will end up initializing next_balance to be 1sec away! Such a CPU will defer load balance check for another 1sec, even though we may bring up more cpus in the meantime requiring it to check for load imbalance more frequently. This could then lead to increased scheduling latency for some tasks. This patch results in a cpu's next_balance being re-calculated when its attaching to a new sched domain hierarchy. This should let cpus call load balance checks at the right time we expect them to! Signed-off-by: Srivatsa Vaddagiri <[email protected]> (cherry picked from commit 91b609bdd91513c37c410ffef0204f8a0578e4f8) Change-Id: Ib457f1a357aad26e49837a01311b73e22e0cc26f Signed-off-by: Sridhar Nelloru <[email protected]> sched: Unthrottle rt runqueues in __disable_runtime() migrate_tasks() uses _pick_next_task_rt() to get tasks from the real-time runqueues to be migrated. When rt_rq is throttled _pick_next_task_rt() won't return anything, in which case migrate_tasks() can't move all threads over and gets stuck in an infinite loop. Instead unthrottle rt runqueues before migrating tasks. Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair() Signed-off-by: Peter Boonstoppel <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix clear NOHZ_BALANCE_KICK I have faced a sequence where the Idle Load Balance was sometime not triggered for a while on my platform, in the following scenario: CPU 0 and CPU 1 are running tasks and CPU 2 is idle CPU 1 kicks the Idle Load Balance CPU 1 selects CPU 2 as the new Idle Load Balancer CPU 2 sets NOHZ_BALANCE_KICK for CPU 2 CPU 2 sends a reschedule IPI to CPU 2 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance task A quickly goes back to sleep (before a tick occurs on CPU 2) CPU 2 goes back to idle with NOHZ_BALANCE_KICK set Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be performed. We must wait for the sched softirq to be raised on CPU 2 thanks to another part the kernel to come back to clear NOHZ_BALANCE_KICK. The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if we can't raise the sched_softirq for the Idle Load Balance. Change since V1: - move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB can't run on this CPU (as suggested by Peter) Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 873b4c65b519fd769940eb281f77848227d4e5c1 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [[email protected]: minor merge resolution for 3.4 in scheduler_ipi()] Signed-off-by: Steve Muckle <[email protected]> Change-Id: I3548612057cccc2ecc29429c129c44183083831f sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s try_to_wake_up_local() should only be invoked to wake up another task in the same runqueue and BUG_ON()s are used to enforce the rule. Missing try_to_wake_up_local() can stall workqueue execution but such stalls are likely to be finite either by another work item being queued or the one blocked getting unblocked. There's no reason to trigger BUG while holding rq lock crashing the whole system. Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s. Change-Id: I75fdaaf4dcaefcf3893e0404d98c8aa91f89934d Signed-off-by: Tejun Heo <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Git-commit: 383efcd00053ec40023010ce5034bd702e7ab373 CRs-Fixed: 519942 Signed-off-by: Steve Muckle <[email protected]> sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local() The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local() were recently converted from BUG_ON() calls. If these hit it indicates something is wrong and that may contribute to other system instability. To eliminate the risk of an instance of one of these errors going un-noticed because there was an earlier instance that occured long ago, change to WARN_ON(). If there ever is a flood of these there are bigger problems. Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7 Signed-off-by: Steve Muckle <[email protected]> sched: Convert WARN_ON() to printk_sched() in try_to_wake_up_local() try_to_wake_up_local() is called with the rq lock held. Printing to console in this context can result in a deadlock if klogd needs to be woken up. Print to the kernel log buffer via printk_sched() instead which avoids the wakeup. Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c Signed-off-by: Syed Rameez Mustafa <[email protected]> sched/rt: Use root_domain of rt_rq not current processor When the system has multiple domains do_sched_rt_period_timer() can run on any CPU and may iterate over all rt_rq in cpu_online_mask. This means when balance_runtime() is run for a given rt_rq that rt_rq may be in a different rd than the current processor. Thus if we use smp_processor_id() to get rd in do_balance_runtime() we may borrow runtime from a rt_rq that is not part of our rd. This changes do_balance_runtime to get the rd from the passed in rt_rq ensuring that we borrow runtime only from the correct rd for the given rt_rq. This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime when we try reclaim runtime lent to other rt_rq but runtime has been lent to a rt_rq in another rd. Change-Id: Id8b9f88bdd783ef87bfa8d288a2fe7e86dd0b1e2 Signed-off-by: Shawn Bohrer <[email protected]> Acked-by: Steven Rostedt <[email protected]> Acked-by: Mike Galbraith <[email protected]> Cc: [email protected] Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 89960feebaf4f9a53f93a0ce6888207e4a808799 Git-repo: git://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git Signed-off-by: Syed Rameez Mustafa <[email protected]> sched: Remove redundant update_runtime notifier migration_call() will do all the things that update_runtime() does. So let's remove it. Furthermore, there is potential risk that the current code will catch BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime threads running because of enabling runtime twice while the rt_runtime may already changed. Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8 Signed-off-by: Neil Zhang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [[email protected]: resolved trivial header file context conflict] Signed-off-by: Matt Wagantall <[email protected]> sched: Fix hotplug vs. set_cpus_allowed_ptr() Lai found that: WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b() ... migration_cpu_stop+0x1d/0x22 was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is always a sub-set of cpu_online_mask. This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness"). So set active and online at the same time to avoid this particular problem. CRs-Fixed: 680496 Change-Id: I89ac9b6829acf200072975bc7d028a469167f083 Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness") Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Gautham R. Shenoy <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael wang <[email protected]> Cc: Paul Gortmaker <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Toshi Kani <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 24d52daafcd53f109893ec2faf668c8eeef4a382 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Matt Wagantall <[email protected]> sched: Fix race between try_to_wake_up() and move_task() Until a task's state has been seen as interruptible/uninterruptible and it is no longer on_cpu, it is possible that the task may move to another CPU (load balancing may cause this). Here is an example where the race condition results in incorrect operation: - cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING - cpu 0 runs task B, which attempts to wake up A - cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0 - cpu 1 then pulls task A (perhaps due to idle balance) - cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE - cpu 0 continues in try_to_wake_up(), thinking task A's previous cpu is 0, where it is actually 1 - if select_task_rq returns cpu 0, task A will be woken up on cpu 0 without properly updating its cpu to 0 in set_task_cpu() CRs-Fixed: 665958 Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99 Signed-off-by: Steve Muckle <[email protected]> sched: Fix reference to stale task_struct in try_to_wake_up() try_to_wake_up() currently drops p->pi_lock and later checks for need to notify cpufreq governor on task migrations or wakeups. However the woken task could exit between the time p->pi_lock is released and the time the test for notification is run. As a result, the test for notification could refer to an exited task. task_notify_on_migrate(p) could thus lead to invalid memory reference. Fix this by running the test for notification with task's pi_lock held. Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1 Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Bail out of yield_to when source and target runqueue has one task In case of undercomitted scenarios, especially in large guests yield_to overhead is significantly high. when run queue length of source and target is one, take an opportunity to bail out and return -ESRCH. This return condition can be further exploited to quickly come out of PLE handler. (History: Raghavendra initially worked on break out of kvm ple handler upon seeing source runqueue length = 1, but it had to export rq length). Peter came up with the elegant idea of return -ESRCH in scheduler core. Signed-off-by: Peter Zijlstra <[email protected]> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi) Reviewed-by: Srikar Dronamraju <[email protected]> Signed-off-by: Raghavendra K T <[email protected]> Acked-by: Andrew Jones <[email protected]> Tested-by: Chegu Vinod <[email protected]> Signed-off-by: Gleb Natapov <[email protected]> Change-Id: I4eca8afc683c1a4b9d2490422baacd3da76c1e92 sched: Fix select_idle_sibling() bouncing cow syndrome If the previous CPU is cache affine and idle, select it. The current implementation simply traverses the sd_llc domain, taking the first idle CPU encountered, which walks buddy pairs hand in hand over the package, inflicting excruciating pain. 1 tbench pair (worst case) in a 10 core + SMT package: pre 15.22 MB/sec 1 procs post 252.01 MB/sec 1 procs Signed-off-by: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix double normalization of vruntime dequeue_entity() is called when p->on_rq and sets se->on_rq = 0 which appears to guarentee that the !se->on_rq condition is met. If the task has done set_current_state(TASK_INTERRUPTIBLE) without schedule() the second condition will be met and vruntime will be incorrectly adjusted twice. In certain cases this can result in the task's vruntime never increasing past the vruntime of other tasks on the CFS' run queue, starving them of CPU time. This patch changes switched_from_fair() to use !p->on_rq instead of !se->on_rq. I'm able to cause a task with a priority of 120 to starve all other tasks with the same priority on an ARM platform running 3.2.51-rt72 PREEMPT RT by writing one character at time to a serial tty (16550 UART) in a tight loop. I'm also able to verify making this change corrects the problem on that platform and kernel version. Change-Id: I70aa0fe0e9be146b1a2a903d010b318390262dc3 Signed-off-by: George McCollister <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix divide by zero at {thread_group,task}_times On architectures where cputime_t is 64 bit type, is possible to trigger divide by zero on do_div(temp, (__force u32) total) line, if total is a non zero number but has lower 32 bit's zeroed. Removing casting is not a good solution since some do_div() implementations do cast to u32 internally. This problem can be triggered in practice on very long lived processes: PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin" #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2 #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00 #3 [ffff880472a51cd0] die at ffffffff8100f26b #4 [ffff880472a51d00] do_trap at ffffffff814f03f4 #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff #6 [ffff880472a51e00] divide_error at ffffffff8100be7b [exception RIP: thread_group_times+0x56] RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046 RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800 R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08 R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d #8 [ffff880472a51f40] sys_times at ffffffff81088524 #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2 RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202 RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000 RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0 RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940 R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940 ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b Change-Id: I1e0fa1fa38b86566f5b20e5c0efa0c9eb17b203b Signed-off-by: Stanislaw Gruszka <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched_clock: Prevent 64bit inatomicity on 32bit systems The sched_clock_remote() implementation has the following inatomicity problem on 32bit systems when accessing the remote scd->clock, which is a 64bit value. CPU0 CPU1 sched_clock_local() sched_clock_remote(CPU0) ... remote_clock = scd[CPU0]->clock read_low32bit(scd[CPU0]->clock) cmpxchg64(scd->clock,...) read_high32bit(scd[CPU0]->clock) While the update of scd->clock is using an atomic64 mechanism, the readout on the remote cpu is not, which can cause completely bogus readouts. It is a quite rare problem, because it requires the update to hit the narrow race window between the low/high readout and the update must go across the 32bit boundary. The resulting misbehaviour is, that CPU1 will see the sched_clock on CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value to this bogus timestamp. This stays that way due to the clamping implementation for about 4 seconds until the synchronization with CLOCK_MONOTONIC undoes the problem. The issue is hard to observe, because it might only result in a less accurate SCHED_OTHER timeslicing behaviour. To create observable damage on realtime scheduling classes, it is necessary that the bogus update of CPU1 sched_clock happens in the context of an realtime thread, which then gets charged 4 seconds of RT runtime, which results in the RT throttler mechanism to trigger and prevent scheduling of RT tasks for a little less than 4 seconds. So this is quite unlikely as well. The issue was quite hard to decode as the reproduction time is between 2 days and 3 weeks and intrusive tracing makes it less likely, but the following trace recorded with trace_clock=global, which uses sched_clock_local(), gave the final hint: <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 What happens is that CPU0 goes idle and invokes sched_clock_idle_sleep_event() which invokes sched_clock_local() and CPU1 runs a remote wakeup for CPU0 at the same time, which invokes sched_remote_clock(). The time jump gets propagated to CPU0 via sched_remote_clock() and stays stale on both cores for ~4 seconds. There are only two other possibilities, which could cause a stale sched clock: 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic wrong value. 2) sched_clock() which reads the TSC returns a sporadic wrong value. one jiffy and then go stale. forward. It would just result in a stale sched_clock for one jiffy. After quite some brain twisting and finding the same pattern on other traces, sched_clock_remote() remained the only place which could cause such a problem and as explained above it's indeed racy on 32bit systems. So while on 64bit systems the readout is atomic, we need to verify the remote readout on 32bit machines. We need to protect the local->clock readout in sched_clock_remote() on 32bit as well because an NMI could hit between the low and the high readout, call sched_clock_local() and modify local->clock. Thanks to Siegfried Wulsch for bearing with my debug requests and going through the tedious tasks of running a bunch of reproducer systems to generate the debug information which let me decode the issue. Change-Id: I767a6a645e17651b39d0a530dc0681756f3b4f5d Reported-by: Siegfried Wulsch <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities This patch touches the RT group scheduling case. Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's priority, while rt_rq passed to them may be not the top-level rt_rq. This is wrong, because changing of priority on a child level does not guarantee that the priority is the highest all over the rq. So, this leak makes RT balancing unusable. The short example: the task having the highest priority among all rq's RT tasks (no one other task has the same priority) are waking on a throttle rt_rq. The rq's cpupri is set to the task's priority equivalent, but real rq->rt.highest_prio.curr is less. The patch below fixes the problem. Change-Id: I01d3cc07870b1825efa1deed3b70f060ac20939f Signed-off-by: Kirill Tkhai <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> CC: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Guarantee new group-entities always have weight Currently, group entity load-weights are initialized to zero. This admits some races with respect to the first time they are re-weighted in earlty use. ( Let g[x] denote the se for "g" on cpu "x". ) Suppose that we have root->a and that a enters a throttled state, immediately followed by a[0]->t1 (the only task running on cpu[0]) blocking: put_prev_task(group_cfs_rq(a[0]), t1) put_prev_entity(..., t1) check_cfs_rq_runtime(group_cfs_rq(a[0])) throttle_cfs_rq(group_cfs_rq(a[0])) Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first time: enqueue_task_fair(rq[0], t2) enqueue_entity(group_cfs_rq(b[0]), t2) enqueue_entity_load_avg(group_cfs_rq(b[0]), t2) account_entity_enqueue(group_cfs_ra(b[0]), t2) update_cfs_shares(group_cfs_rq(b[0])) < skipped because b is part of a throttled hierarchy > enqueue_entity(group_cfs_rq(a[0]), b[0]) ... We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0 which violates invariants in several code-paths. Eliminate the possibility of this by initializing group entity weight. Change-Id: I1b09c5c4f85df1de565e7c862ba8dde65851c05f Signed-off-by: Paul Turner <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <[email protected]> Cc: Chris J Arges <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Fix isolated CPUs leaving root_task_group indefinitely throttled Root task group bandwidth replenishment must service all CPUs, regardless of where the timer was last started, and regardless of the isolation mechanism, lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy. Change-Id: I8485724e3ff0960008aa13634e55bcb6ceacbc01 Signed-off-by: Mike Galbraith <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Avoid updating RT entry timeout twice within one tick period The issue below was found in 2.6.34-rt rather than mainline rt kernel, but the issue still exists upstream as well. So please let me describe how it was noticed on 2.6.34-rt: On this version, each softirq has its own thread, which means there is at least one RT FIFO task per cpu. The priority of these tasks is set to 49 by default. If user launches an RT FIFO task with priority lower than 49 of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued one cpu runqueue at one moment. By current strategy of balancing RT tasks, when it comes to RT tasks, we really need to put them off to a CPU that they can run on as soon as possible. Even if it means a bit of cache line flushing, we want RT tasks to be run with the least latency. When the user RT FIFO task which just launched before is running, the sched timer tick of the current cpu happens. In this tick period, the timeout value of the user RT task will be updated once. Subsequently, we try to wake up one softirq RT task on its local cpu. As the priority of current user RT task is lower than the softirq RT task, the current task will be preempted by the higher priority softirq RT task. Before preemption, we check to see if current can readily move to a different cpu. If so, we will reschedule to allow the RT push logic to try to move current somewhere else. Whenever the woken softirq RT task runs, it first tries to migrate the user FIFO RT task over to a cpu that is running a task of lesser priority. If migration is done, it will send a reschedule request to the found cpu by IPI interrupt. Once the target cpu responds the IPI interrupt, it will pick the migrated user RT task to preempt its current task. When the user RT task is running on the new cpu, the sched timer tick of the cpu fires. So it will tick the user RT task again. This also means the RT task timeout value will be updated again. As the migration may be done in one tick period, it means the user RT task timeout value will be updated twice within one tick. If we set a limit on the amount of cpu time for the user RT task by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon reaching the soft limit. But exactly when the SIGXCPU signal should be sent depends on the RT task timeout value. In fact the timeout mechanism of sending the SIGXCPU signal assumes the RT task timeout is increased once every tick. However, currently the timeout value may be added twice per tick. So it results in the SIGXCPU signal being sent earlier than expected. To solve this issue, we prevent the timeout value from increasing twice within one tick time by remembering the jiffies value of last updating the timeout. As long as the RT task's jiffies is different with the global jiffies value, we allow its timeout to be updated. Signed-off-by: Ying Xue <[email protected]> Signed-off-by: Fan Du <[email protected]> Reviewed-by: Yong Zhang <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> [ lizf: backported to 3.4: adjust context ] Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Change-Id: I1d126959d88f3bc1adb4a707661efb7162048424 sched: rt: Reduce rq lock contention by eliminating locking of non-feasible target This patch adds checks that prevens futile attempts to move rt tasks to a CPU with active tasks of equal or higher priority. This reduces run queue lock contention and improves the performance of a well known OLTP benchmark by 0.7%. Change-Id: I232abe50623593d51be508fc64e8b865dcc77212 Signed-off-by: Tim Chen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Shawn Bohrer <[email protected]> Cc: Suruchi Kadu <[email protected]> Cc: Doug Nelson<[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Zefan Li <[email protected]> sched: Fix RLIMIT_RTTIME when PI-boosting to RT When non-realtime tasks get priority-inheritance boosted to a realtime scheduling class, RLIMIT_RTTIME starts to apply to them. However, the counter used for checking this (the same one used for SCHED_RR timeslices) was not getting reset. This meant that tasks running with a non-realtime scheduling class which are repeatedly boosted to a realtime one, but never block while they are running realtime, eventually hit the timeout without ever running for a time over the limit. This patch resets the realtime timeslice counter when un-PI-boosting from an RT to a non-RT scheduling class. I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT mutex which induces priority boosting and spins while boosted that gets killed by a SIGXCPU on non-fixed kernels but doesn't with this patch applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and does happen eventually with PREEMPT_VOLUNTARY kernels. Change-Id: I732036119517d9a9625cea4644b4509c1d6cdf13 Signed-off-by: Brian Silverman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> [lizf: Backported to 3.4: adjust contest] Signed-off-by: Zefan Li <[email protected]> sched: Queue RT tasks to head when prio drops The following scenario does not work correctly: Runqueue of CPUx contains two runnable and pinned tasks: T1: SCHED_FIFO, prio 80 T2: SCHED_FIFO, prio 80 T1 is on the cpu and executes the following syscalls (classic priority ceiling scenario): sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90); ... sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80); ... Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back to sleep the scheduler picks T2. Surprise! The same happens w/o actual preemption when T1 is forced into the scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes pick_next_task() which returns T2. So T1 gets preempted and scheduled out. This happens because sched_setscheduler() dequeues T1 from the prio 90 list and then enqueues it on the tail of the prio 80 list behind T2. This violates the POSIX spec and surprises user space which relies on the guarantee that SCHED_FIFO tasks are not scheduled out unless they give the CPU up voluntarily or are preempted by a higher priority task. In the latter case the preempted task must get back on the CPU after the preempting task schedules out again. We fixed a similar issue already in commit 60db48c (sched: Queue a deboosted task to the head of the RT prio queue). The same treatment is necessary for sched_setscheduler(). So enqueue to head of the prio bucket list if the priority of the task is lowered. It might be possible that existing user space relies on the current behaviour, but it can be considered highly unlikely due to the corner case nature of the application scenario. Change-Id: I65f159b3c2d265dcaf27b70bf9be88cf922ab845 Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Zefan Li <[email protected]> sched: Fix the theoretical signal_wake_up() vs schedule() race This is only theoretical, but after try_to_wake_up(p) was changed to check p->state under p->pi_lock the code like: __set_current_state(TASK_INTERRUPTIBLE); schedule(); can miss a signal. This is the special case of wait-for-condition, it relies on try_to_wake_up/schedule interaction and thus it does not need mb() between __set_current_state() and if (signal_pending). However, this __set_current_state() can move into the critical section protected by rq->lock, now that try_to_wake_up() takes another lock we need to ensure that it can't be reordered with "if (signal_pending(current))" check inside that section. The patch is actually one-liner, it simply adds smp_wmb() before spin_lock_irq(rq->lock). This is what try_to_wake_up() already does by the same reason. We turn this wmb() into the new helper, smp_mb__before_spinlock(), for better documentation and to allow the architectures to change the default implementation. While at it, kill smp_mb__after_lock(), it has no callers. Perhaps we can also add smp_mb__before/after_spinunlock() for prepare_to_wait(). Change-Id: I2b329b2ae213cc9221811ca244d2ebb044011a1b Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix race in idle_balance() The scheduler main function 'schedule()' checks if there are no more tasks on the runqueue. Then it checks if a task should be pulled in the current runqueue in idle_balance() assuming it will go to idle otherwise. But idle_balance() releases the rq->lock in order to look up the sched domains and takes the lock again right after. That opens a window where another cpu may put a task in our runqueue, so we won't go to idle but we have filled the idle_stamp, thinking we will. This patch closes the window by checking if the runqueue has been modified but without pulling a task after taking the lock again, so we won't go to idle right after in the __schedule() function. Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8 Signed-off-by: Daniel Lezcano <[email protected]> Cc: [email protected] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix load avg vs cpu-hotplug Rabik and Paul reported two different issues related to the same few lines of code. Rabik's issue is that the nr_uninterruptible migration code is wrong in that he sees artifacts due to this (Rabik please do expand in more detail). Paul's issue is that this code as it stands relies on us using stop_machine() for unplug, we all would like to remove this assumption so that eventually we can remove this stop_machine() usage altogether. The only reason we'd have to migrate nr_uninterruptible is so that we could use for_each_online_cpu() loops in favour of for_each_possible_cpu() loops, however since nr_uninterruptible() is the only such loop and its using possible lets not bother at all. The problem Rabik sees is (probably) caused by the fact that by migrating nr_uninterruptible we screw rq->calc_load_active for both rqs involved. So don't bother with fancy migration schemes (meaning we now have to keep using for_each_possible_cpu()) and instead fold any nr_active delta after we migrate all tasks away to make sure we don't have any skewed nr_active accounting. Reported-by: Rakib Mullick <[email protected]> Reported-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins Signed-off-by: Ingo Molnar <[email protected]> sched: Fix load avg vs. cpu-hotplug Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an incomplete fix: In particular, the problem is that at the point it calls calc_load_migrate() nr_running := 1 (the stopper thread), so move the call to CPU_DEAD where we're sure that nr_running := 0. Also note that we can call calc_load_migrate() without serialization, we know the state of rq is stable since its cpu is dead, and we modify the global state using appropriate atomic ops. Change-Id: I24063eebf4429d7e01838af2cb68e0f3c2cf2131 Suggested-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins Signed-off-by: Ingo Molnar <[email protected]> sched/nohz: Rewrite and fix load-avg computation -- again Thanks to Charles Wang for spotting the defects in the current code: - If we go idle during the sample window -- after sampling, we get a negative bias because we can negate our own sample. - If we wake up during the sample window we get a positive bias because we push the sample to a known active period. So rewrite the entire nohz load-avg muck once again, now adding copious documentation to the code. Reported-and-tested-by: Doug Smythies <[email protected]> Reported-and-tested-by: Charles Wang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins [ minor edits ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Change-Id: Ib735002c79eaace9fa5f7795fdc2f80b638b56a7 sched: Fix the broken sched_rr_get_interval() The caller of sched_sliced() should pass se.cfs_rq and se as the arguments, however in sched_rr_get_interval() we gave it rq.cfs_rq and se, which made the following computation obviously wrong. The change was introduced by commit: 77034937dc45 sched: fix crash in sys_sched_rr_get_interval() ... 5 years ago, while it had been the correct 'cfs_rq_of' before the commit. The change seems to be irrelevant to the commit msg, which was to return a 0 timeslice for tasks that are on an idle runqueue. So I believe that was just a plain typo. Signed-off-by: Zhu Yanhai <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] [ Since this is an ABI and an old bug, we'll test this via a slow upstream route, to hopefully discover any app breakage. ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix race in task_group() Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call task_group() too many times in set_task_rq()"). He found the reason to be that the multiple task_group() invocations in set_task_rq() returned different values. Looking at all that I found a lack of serialization and plain wrong comments. The below tries to fix it using an extra pointer which is updated under the appropriate scheduler locks. Its not pretty, but I can't really see another way given how all the cgroup stuff works. Change-Id: I6906c92ea6f66d19af5938a9f5ec9f1ff4577c39 Reported-and-tested-by: Stefan Bader <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed masks as and when necessary to ensure that the tasks belonging to the cpusets have some place (online CPUs) to run on. And regular CPU hotplug is destructive in the sense that the kernel doesn't remember the original cpuset configurations set by the user, across hotplug operations. However, suspend/resume (which uses CPU hotplug) is a special case in which the kernel has the responsibility to restore the system (during resume), to exactly the same state it was in before suspend. In order to achieve that, do the following: 1. Don't modify cpusets during suspend/resume. At all. In particular, don't move the tasks from one cpuset to another, and don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets during the CPU hotplug operations that are carried out in the suspend/resume path. 2. However, cpusets and sched domains are related. We just want to avoid altering cpusets alone. So, to keep the sched domains updated, build a single sched domain (containing all active cpus) during each of the CPU hotplug operations carried out in s/r path, effectively ignoring the cpusets' cpus_allowed masks. (Since userspace is frozen while doing all this, it will go unnoticed.) 3. During the last CPU online operation during resume, build the sched domains by looking up the (unaltered) cpusets' cpus_allowed masks. That will bring back the system to the same original state as it was in before suspend. Ultimately, this will not only solve the cpuset problem related to suspend resume (ie., restores the cpusets to exactly what it was before suspend, by not touching it at all) but also speeds up suspend/resume because we avoid running cpuset update code for every CPU being offlined/onlined. Change-Id: I12c3e6289d5aa069531f8f616d63412b38103d99 Signed-off-by: Srivatsa S. Bhat <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Preeti U Murthy <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones There is a small race between copy_process() and cgroup_attach_task() where child->se.parent,cfs_rq points to invalid (old) ones. parent doing fork() | someone moving the parent to another cgroup -------------------------------+--------------------------------------------- copy_process() + dup_task_struct() -> parent->se is copied to child->se. se.parent,cfs_rq of them point to old ones. cgroup_attach_task() + cgroup_task_migrate() -> parent->cgroup is updated. + cpu_cgroup_attach() + sched_move_task() + task_move_group_fair() +- set_task_rq() -> se.parent,cfs_rq of parent are updated. + cgroup_fork() -> parent->cgroup is copied to child->cgroup. (*1) + sched_fork() + task_fork_fair() -> se.parent,cfs_rq of child are accessed while they point to old ones. (*2) In the worst case, this bug can lead to "use-after-free" and cause a panic, because it's new cgroup's refcount that is incremented at (*1), so the old cgroup(and related data) can be freed before (*2). In fact, a panic caused by this bug was originally caught in RHEL6.4. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0 [...] Call Trace: [<ffffffff81051f25>] place_entity+0x75/0xa0 [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160 [<ffffffff81063c0b>] sched_fork+0x6b/0x140 [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450 [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130 [<ffffffff8106d2f4>] do_fork+0x94/0x460 [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100 [<ffffffff81009598>] sys_clone+0x28/0x30 [<ffffffff8100b393>] stub_clone+0x13/0x20 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Daisuke Nishimura <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/nohz: Fix rq->cpu_load[] calculations While investigating why the load-balancer did funny I found that the rq->cpu_load[] tables were completely screwy.. a bit more digging revealed that the updates that got through were missing ticks followed by a catchup of 2 ticks. The catchup assumes the cpu was idle during that time (since only nohz can cause missed ticks and the machine is idle etc..) this means that esp. the higher indices were significantly lower than they ought to be. The reason for this is that its not correct to compare against jiffies on every jiffy on any other cpu than the cpu that updates jiffies. This patch cludges around it by only doing the catch-up stuff from nohz_idle_balance() and doing the regular stuff unconditionally from the tick. Change-Id: Iac6eda1ded5074afb261ccc0fe8a4e70432bb031 Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Venkatesh Pallipadi <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/nohz: Fix rq->cpu_load calculations some more commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream. Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[] calculations") since while that fixed the busy case it regressed the mostly idle case. Add a callback from the nohz exit to also age the rq->cpu_load[] array. This closes the hole where either there was no nohz load balance pass during the nohz, or there was a 'significant' amount of idle time between the last nohz balance and the nohz exit. So we'll update unconditionally from the tick to not insert any accidental 0 load periods while busy, and we try and catch up from nohz idle balance and nohz exit. Both these are still prone to missing a jiffy, but that has always been the case. Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Venkatesh Pallipadi <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/core: Clear the root_domain cpumasks in init_rootdomain() root_domain::rto_mask allocated through alloc_cpumask_var() contains garbage data, and this may cause problems. For instance, when doing pull_rt_task(), it may do useless iterations if rto_mask retains some extra garbage bits. Worse still, this violates the isolated domain rule for clustered scheduling using cpuset, because the tasks (with all the cpus allowed) belongs to one root domain can be pulled away into another root domain. The patch cleans the garbage by using zalloc_cpumask_var() instead of alloc_cpumask_var() for root_domain::rto_mask allocation, thereby addressing the issues. Do the same thing for root_domain's other cpumask memembers: - span, - online; Signed-off-by: Xunlei Pang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> softirq: Reduce latencies In various network workloads, __do_softirq() latencies can be up to 20 ms if HZ=1000, and 200 ms if HZ=100. This is because we iterate 10 times in the softirq dispatcher, and some actions can consume a lot of cycles. This patch changes the fallback to ksoftirqd condition to: - A time limit of 2 ms, - need_resched() being set on current task; When one of this condition is met, we wakeup ksoftirqd for further softirq processing if we still have pending softirqs. Using need_resched() as the only condition can trigger RCU stalls, as we can keep BH disabled for too long. I ran several benchmarks and got no significant difference in throughput, but a very significant reduction of latencies (one order of magnitude): In following bench, 200 antagonist "netperf -t TCP_RR" are started in background, using all available cpus. Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC IRQ (hard + soft). Before patch: RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=550110.424 MIN_LATENCY=146858 MAX_LATENCY=997109 P50_LATENCY=305000 P90_LATENCY=550000 P99_LATENCY=710000 MEAN_LATENCY=376989.12 STDDEV_LATENCY=184046.92 After patch: RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=40545.492 MIN_LATENCY=9834 MAX_LATENCY=78366 P50_LATENCY=33583 P90_LATENCY=59000 P99_LATENCY=69000 MEAN_LATENCY=38364.67 STDDEV_LATENCY=12865.26 Signed-off-by: Eric Dumazet <[email protected]> Cc: David Miller <[email protected]> Cc: Tom Herbert <[email protected]> Cc: Ben Hutchings <[email protected]> Signed-off-by: David S. Miller <[email protected]> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <[email protected]> Signed-off-by: Zefan Li <[email protected]> softirq: Fix lockup related to stop_machine being stuck in __do_softirq. The stop machine logic can lock up if all but one of the migration threads make it through the disable-irq step and the one remaining thread gets stuck in __do_softirq. The reason __do_softirq can hang is that it has a bail-out based on jiffies timeout, but in the lockup case, jiffies itself is not incremented. To work around this, re-add the max_restart counter in __do_irq and stop processing irqs after 10 restarts. Thanks to Tejun Heo and Rusty Russell and others for helping me track this down. This was introduced in 3.9 by commit c10d73671ad3 ("softirq: reduce latencies"). It may be worth looking into ath9k to see if it has issues with its irq handler at a later date. The hang stack traces look something like this: ------------[ cut here ]------------ WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7() Watchdog detected hard LOCKUP on cpu 2 Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] Pid: 23, comm: migration/2 Tainted: G C 3.9.4+ #11 Call Trace: <NMI> warn_slowpath_common+0x85/0x9f warn_slowpath_fmt+0x46/0x48 watchdog_overflow_callback+0x9c/0xa7 __perf_event_overflow+0x137/0x1cb perf_event_overflow+0x14/0x16 intel_pmu_handle_irq+0x2dc/0x359 perf_event_nmi_handler+0x19/0x1b nmi_handle+0x7f/0xc2 do_nmi+0xbc/0x304 end_repeat_nmi+0x1e/0x2e <<EOE>> cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 ---[ end trace 4947dfa9b0a4cec3 ]--- BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17] Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] irq event stamp: 835637905 hardirqs last enabled at (835637904): __do_softirq+0x9f/0x257 hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80 softirqs last enabled at (5654720): __do_softirq+0x1ff/0x257 softirqs last disabled at (5654725): irq_exit+0x5f/0xbb CPU 1 Pid: 17, comm: migration/1 Tainted: G WC 3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: tasklet_hi_action+0xf0/0xf0 Process migration/1 Call Trace: <IRQ> __do_softirq+0x117/0x257 irq_exit+0x5f/0xbb smp_apic_timer_interrupt+0x8a/0x98 apic_timer_interrupt+0x72/0x80 <EOI> printk+0x4d/0x4f stop_machine_cpu_stop+0x22c/0x274 cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 Change-Id: Ib213d5a5c59c85ae16afde6f42a01e39d2c93fd7 Signed-off-by: Ben Greear <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Pekka Riikonen <[email protected]> Cc: Eric Dumazet <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <[email protected]> Signed-off-by: Zefan Li <[email protected]>
alesaiko
pushed a commit
that referenced
this issue
Apr 24, 2018
Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior of waking their tasks up on idle CPUs. The feature has too much of a negative impact on other workloads however to apply globally. The PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this flag set, or tasks woken by tasks with this flag set, on an idle CPU if one is available. Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039 Signed-off-by: Steve Muckle <[email protected]> sched: add sysctl for controlling task migrations on wake The PF_WAKE_UP_IDLE per-task flag made it impossible to enable the old behavior of SD_SHARE_PKG_RESOURCES, where every task migrates to an idle CPU on wakeup. The sched_wake_to_idle sysctl value, when made nonzero, will cause all tasks to migrate to an idle CPU if one is available when the task is woken up. This is regardless of how PF_WAKE_UP_IDLE is configured for tasks in the system. Similar to PF_WAKE_UP_IDLE, the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled for the sysctl value to have an effect. Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61 Signed-off-by: Steve Muckle <[email protected]> (cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282) Conflicts: kernel/sched/fair.c sched: Reset rq->next_interval before going idle next_balance, the point in jiffy time scale when a cpu will next load balance, could have been calculated when the cpu was busy. A busy cpu will apply its sched domain's busy_factor (usually > 1) in computing next_balance for that sched domain, which causes the (busy) cpu to load balance less frequently in its sched domains. However when the same cpu is going idle, its next_balance needs to be reset without consideration of busy_factor. Failure to do so would not trigger nohz idle balancer on that cpu for unnecessarily long time (introducing additional scheduling latencies for tasks). Fix bug in scheduler which aims to reset next_balance before a cpu goes idle (as per existing comment) but is clearly not doing so. Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4 Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Fix SCHED_HRTICK bug leading to late preemption of tasks SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot (just when they would have exceeded their ideal_runtime). It makes use of a a per-cpu hrtimer resource and hence alarming that hrtimer should be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs, rather than being based on number of tasks in a particular cfs_rq (as implemented currently). As a result, with current code, its possible for a running task (which is the sole task in its cfs_rq) to be preempted much after its ideal_runtime has elapsed, resulting in increased latency for tasks in other cfs_rq on same cpu. Fix this by alarming sched hrtimer based on total number of SCHED_FAIR tasks a CPU has across its various cfs_rqs. Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Mark schedule_io_timeout() with EXPORT_SYMBOL Make schedule_io_timeout() visible to modules. Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2 Signed-off-by: Jordan Crouse <[email protected]> sched: fix reference to wrong cfs_rq Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption of tasks) introduced a bug in sched_slice() calculation by using wrong cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather than the correct one to which they belonged. Fix the bug by using correct cfs_rq for tasks. Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Make sure to not re-read variables after validation We could re-read rq->rt_avg after we validated it was smaller than total, invalidating the check and resulting in an unintended negative. Signed-off-by: Peter Zijlstra <[email protected]> Cc: David Rientjes <[email protected]> Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins Signed-off-by: Ingo Molnar <[email protected]> CRs-Fixed: 497236 Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Srivatsa Vaddagiri <[email protected]> (cherry picked from commit fef154fa32c5d8ef992f11e11919495979970678) Change-Id: Ieac7a10d9f4ceb701660f8055efd7a0ed5d939ca Signed-off-by: Sridhar Nelloru <[email protected]> sched: re-calculate a cpu's next_balance point upon sched domain changes Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset a cpu's rq->next_balance when pulled_task = 0, which will be true when the cpu failed to pull any task, causing it go idle. However that patch relied on next_balance being calculated as a result of traversing cpu's sched domain hierarchy. A cpu that is the only online cpu will however not be attached to any sched domain hierarchy. When such a cpu calls into idle_balance(), we will end up initializing next_balance to be 1sec away! Such a CPU will defer load balance check for another 1sec, even though we may bring up more cpus in the meantime requiring it to check for load imbalance more frequently. This could then lead to increased scheduling latency for some tasks. This patch results in a cpu's next_balance being re-calculated when its attaching to a new sched domain hierarchy. This should let cpus call load balance checks at the right time we expect them to! Signed-off-by: Srivatsa Vaddagiri <[email protected]> (cherry picked from commit 91b609bdd91513c37c410ffef0204f8a0578e4f8) Change-Id: Ib457f1a357aad26e49837a01311b73e22e0cc26f Signed-off-by: Sridhar Nelloru <[email protected]> sched: Unthrottle rt runqueues in __disable_runtime() migrate_tasks() uses _pick_next_task_rt() to get tasks from the real-time runqueues to be migrated. When rt_rq is throttled _pick_next_task_rt() won't return anything, in which case migrate_tasks() can't move all threads over and gets stuck in an infinite loop. Instead unthrottle rt runqueues before migrating tasks. Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair() Signed-off-by: Peter Boonstoppel <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix clear NOHZ_BALANCE_KICK I have faced a sequence where the Idle Load Balance was sometime not triggered for a while on my platform, in the following scenario: CPU 0 and CPU 1 are running tasks and CPU 2 is idle CPU 1 kicks the Idle Load Balance CPU 1 selects CPU 2 as the new Idle Load Balancer CPU 2 sets NOHZ_BALANCE_KICK for CPU 2 CPU 2 sends a reschedule IPI to CPU 2 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance task A quickly goes back to sleep (before a tick occurs on CPU 2) CPU 2 goes back to idle with NOHZ_BALANCE_KICK set Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be performed. We must wait for the sched softirq to be raised on CPU 2 thanks to another part the kernel to come back to clear NOHZ_BALANCE_KICK. The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if we can't raise the sched_softirq for the Idle Load Balance. Change since V1: - move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB can't run on this CPU (as suggested by Peter) Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 873b4c65b519fd769940eb281f77848227d4e5c1 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [[email protected]: minor merge resolution for 3.4 in scheduler_ipi()] Signed-off-by: Steve Muckle <[email protected]> Change-Id: I3548612057cccc2ecc29429c129c44183083831f sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s try_to_wake_up_local() should only be invoked to wake up another task in the same runqueue and BUG_ON()s are used to enforce the rule. Missing try_to_wake_up_local() can stall workqueue execution but such stalls are likely to be finite either by another work item being queued or the one blocked getting unblocked. There's no reason to trigger BUG while holding rq lock crashing the whole system. Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s. Change-Id: I75fdaaf4dcaefcf3893e0404d98c8aa91f89934d Signed-off-by: Tejun Heo <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Git-commit: 383efcd00053ec40023010ce5034bd702e7ab373 CRs-Fixed: 519942 Signed-off-by: Steve Muckle <[email protected]> sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local() The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local() were recently converted from BUG_ON() calls. If these hit it indicates something is wrong and that may contribute to other system instability. To eliminate the risk of an instance of one of these errors going un-noticed because there was an earlier instance that occured long ago, change to WARN_ON(). If there ever is a flood of these there are bigger problems. Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7 Signed-off-by: Steve Muckle <[email protected]> sched: Convert WARN_ON() to printk_sched() in try_to_wake_up_local() try_to_wake_up_local() is called with the rq lock held. Printing to console in this context can result in a deadlock if klogd needs to be woken up. Print to the kernel log buffer via printk_sched() instead which avoids the wakeup. Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c Signed-off-by: Syed Rameez Mustafa <[email protected]> sched/rt: Use root_domain of rt_rq not current processor When the system has multiple domains do_sched_rt_period_timer() can run on any CPU and may iterate over all rt_rq in cpu_online_mask. This means when balance_runtime() is run for a given rt_rq that rt_rq may be in a different rd than the current processor. Thus if we use smp_processor_id() to get rd in do_balance_runtime() we may borrow runtime from a rt_rq that is not part of our rd. This changes do_balance_runtime to get the rd from the passed in rt_rq ensuring that we borrow runtime only from the correct rd for the given rt_rq. This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime when we try reclaim runtime lent to other rt_rq but runtime has been lent to a rt_rq in another rd. Change-Id: Id8b9f88bdd783ef87bfa8d288a2fe7e86dd0b1e2 Signed-off-by: Shawn Bohrer <[email protected]> Acked-by: Steven Rostedt <[email protected]> Acked-by: Mike Galbraith <[email protected]> Cc: [email protected] Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 89960feebaf4f9a53f93a0ce6888207e4a808799 Git-repo: git://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git Signed-off-by: Syed Rameez Mustafa <[email protected]> sched: Remove redundant update_runtime notifier migration_call() will do all the things that update_runtime() does. So let's remove it. Furthermore, there is potential risk that the current code will catch BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime threads running because of enabling runtime twice while the rt_runtime may already changed. Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8 Signed-off-by: Neil Zhang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [[email protected]: resolved trivial header file context conflict] Signed-off-by: Matt Wagantall <[email protected]> sched: Fix hotplug vs. set_cpus_allowed_ptr() Lai found that: WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b() ... migration_cpu_stop+0x1d/0x22 was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is always a sub-set of cpu_online_mask. This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness"). So set active and online at the same time to avoid this particular problem. CRs-Fixed: 680496 Change-Id: I89ac9b6829acf200072975bc7d028a469167f083 Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness") Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Gautham R. Shenoy <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael wang <[email protected]> Cc: Paul Gortmaker <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Toshi Kani <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Git-commit: 24d52daafcd53f109893ec2faf668c8eeef4a382 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Matt Wagantall <[email protected]> sched: Fix race between try_to_wake_up() and move_task() Until a task's state has been seen as interruptible/uninterruptible and it is no longer on_cpu, it is possible that the task may move to another CPU (load balancing may cause this). Here is an example where the race condition results in incorrect operation: - cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING - cpu 0 runs task B, which attempts to wake up A - cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0 - cpu 1 then pulls task A (perhaps due to idle balance) - cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE - cpu 0 continues in try_to_wake_up(), thinking task A's previous cpu is 0, where it is actually 1 - if select_task_rq returns cpu 0, task A will be woken up on cpu 0 without properly updating its cpu to 0 in set_task_cpu() CRs-Fixed: 665958 Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99 Signed-off-by: Steve Muckle <[email protected]> sched: Fix reference to stale task_struct in try_to_wake_up() try_to_wake_up() currently drops p->pi_lock and later checks for need to notify cpufreq governor on task migrations or wakeups. However the woken task could exit between the time p->pi_lock is released and the time the test for notification is run. As a result, the test for notification could refer to an exited task. task_notify_on_migrate(p) could thus lead to invalid memory reference. Fix this by running the test for notification with task's pi_lock held. Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1 Signed-off-by: Srivatsa Vaddagiri <[email protected]> sched: Bail out of yield_to when source and target runqueue has one task In case of undercomitted scenarios, especially in large guests yield_to overhead is significantly high. when run queue length of source and target is one, take an opportunity to bail out and return -ESRCH. This return condition can be further exploited to quickly come out of PLE handler. (History: Raghavendra initially worked on break out of kvm ple handler upon seeing source runqueue length = 1, but it had to export rq length). Peter came up with the elegant idea of return -ESRCH in scheduler core. Signed-off-by: Peter Zijlstra <[email protected]> Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi) Reviewed-by: Srikar Dronamraju <[email protected]> Signed-off-by: Raghavendra K T <[email protected]> Acked-by: Andrew Jones <[email protected]> Tested-by: Chegu Vinod <[email protected]> Signed-off-by: Gleb Natapov <[email protected]> Change-Id: I4eca8afc683c1a4b9d2490422baacd3da76c1e92 sched: Fix select_idle_sibling() bouncing cow syndrome If the previous CPU is cache affine and idle, select it. The current implementation simply traverses the sd_llc domain, taking the first idle CPU encountered, which walks buddy pairs hand in hand over the package, inflicting excruciating pain. 1 tbench pair (worst case) in a 10 core + SMT package: pre 15.22 MB/sec 1 procs post 252.01 MB/sec 1 procs Signed-off-by: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix double normalization of vruntime dequeue_entity() is called when p->on_rq and sets se->on_rq = 0 which appears to guarentee that the !se->on_rq condition is met. If the task has done set_current_state(TASK_INTERRUPTIBLE) without schedule() the second condition will be met and vruntime will be incorrectly adjusted twice. In certain cases this can result in the task's vruntime never increasing past the vruntime of other tasks on the CFS' run queue, starving them of CPU time. This patch changes switched_from_fair() to use !p->on_rq instead of !se->on_rq. I'm able to cause a task with a priority of 120 to starve all other tasks with the same priority on an ARM platform running 3.2.51-rt72 PREEMPT RT by writing one character at time to a serial tty (16550 UART) in a tight loop. I'm also able to verify making this change corrects the problem on that platform and kernel version. Change-Id: I70aa0fe0e9be146b1a2a903d010b318390262dc3 Signed-off-by: George McCollister <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix divide by zero at {thread_group,task}_times On architectures where cputime_t is 64 bit type, is possible to trigger divide by zero on do_div(temp, (__force u32) total) line, if total is a non zero number but has lower 32 bit's zeroed. Removing casting is not a good solution since some do_div() implementations do cast to u32 internally. This problem can be triggered in practice on very long lived processes: PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin" #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2 #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00 #3 [ffff880472a51cd0] die at ffffffff8100f26b #4 [ffff880472a51d00] do_trap at ffffffff814f03f4 #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff #6 [ffff880472a51e00] divide_error at ffffffff8100be7b [exception RIP: thread_group_times+0x56] RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046 RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800 R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08 R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d #8 [ffff880472a51f40] sys_times at ffffffff81088524 #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2 RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202 RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000 RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0 RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940 R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940 ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b Change-Id: I1e0fa1fa38b86566f5b20e5c0efa0c9eb17b203b Signed-off-by: Stanislaw Gruszka <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched_clock: Prevent 64bit inatomicity on 32bit systems The sched_clock_remote() implementation has the following inatomicity problem on 32bit systems when accessing the remote scd->clock, which is a 64bit value. CPU0 CPU1 sched_clock_local() sched_clock_remote(CPU0) ... remote_clock = scd[CPU0]->clock read_low32bit(scd[CPU0]->clock) cmpxchg64(scd->clock,...) read_high32bit(scd[CPU0]->clock) While the update of scd->clock is using an atomic64 mechanism, the readout on the remote cpu is not, which can cause completely bogus readouts. It is a quite rare problem, because it requires the update to hit the narrow race window between the low/high readout and the update must go across the 32bit boundary. The resulting misbehaviour is, that CPU1 will see the sched_clock on CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value to this bogus timestamp. This stays that way due to the clamping implementation for about 4 seconds until the synchronization with CLOCK_MONOTONIC undoes the problem. The issue is hard to observe, because it might only result in a less accurate SCHED_OTHER timeslicing behaviour. To create observable damage on realtime scheduling classes, it is necessary that the bogus update of CPU1 sched_clock happens in the context of an realtime thread, which then gets charged 4 seconds of RT runtime, which results in the RT throttler mechanism to trigger and prevent scheduling of RT tasks for a little less than 4 seconds. So this is quite unlikely as well. The issue was quite hard to decode as the reproduction time is between 2 days and 3 weeks and intrusive tracing makes it less likely, but the following trace recorded with trace_clock=global, which uses sched_clock_local(), gave the final hint: <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 What happens is that CPU0 goes idle and invokes sched_clock_idle_sleep_event() which invokes sched_clock_local() and CPU1 runs a remote wakeup for CPU0 at the same time, which invokes sched_remote_clock(). The time jump gets propagated to CPU0 via sched_remote_clock() and stays stale on both cores for ~4 seconds. There are only two other possibilities, which could cause a stale sched clock: 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic wrong value. 2) sched_clock() which reads the TSC returns a sporadic wrong value. one jiffy and then go stale. forward. It would just result in a stale sched_clock for one jiffy. After quite some brain twisting and finding the same pattern on other traces, sched_clock_remote() remained the only place which could cause such a problem and as explained above it's indeed racy on 32bit systems. So while on 64bit systems the readout is atomic, we need to verify the remote readout on 32bit machines. We need to protect the local->clock readout in sched_clock_remote() on 32bit as well because an NMI could hit between the low and the high readout, call sched_clock_local() and modify local->clock. Thanks to Siegfried Wulsch for bearing with my debug requests and going through the tedious tasks of running a bunch of reproducer systems to generate the debug information which let me decode the issue. Change-Id: I767a6a645e17651b39d0a530dc0681756f3b4f5d Reported-by: Siegfried Wulsch <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities This patch touches the RT group scheduling case. Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's priority, while rt_rq passed to them may be not the top-level rt_rq. This is wrong, because changing of priority on a child level does not guarantee that the priority is the highest all over the rq. So, this leak makes RT balancing unusable. The short example: the task having the highest priority among all rq's RT tasks (no one other task has the same priority) are waking on a throttle rt_rq. The rq's cpupri is set to the task's priority equivalent, but real rq->rt.highest_prio.curr is less. The patch below fixes the problem. Change-Id: I01d3cc07870b1825efa1deed3b70f060ac20939f Signed-off-by: Kirill Tkhai <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> CC: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Guarantee new group-entities always have weight Currently, group entity load-weights are initialized to zero. This admits some races with respect to the first time they are re-weighted in earlty use. ( Let g[x] denote the se for "g" on cpu "x". ) Suppose that we have root->a and that a enters a throttled state, immediately followed by a[0]->t1 (the only task running on cpu[0]) blocking: put_prev_task(group_cfs_rq(a[0]), t1) put_prev_entity(..., t1) check_cfs_rq_runtime(group_cfs_rq(a[0])) throttle_cfs_rq(group_cfs_rq(a[0])) Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first time: enqueue_task_fair(rq[0], t2) enqueue_entity(group_cfs_rq(b[0]), t2) enqueue_entity_load_avg(group_cfs_rq(b[0]), t2) account_entity_enqueue(group_cfs_ra(b[0]), t2) update_cfs_shares(group_cfs_rq(b[0])) < skipped because b is part of a throttled hierarchy > enqueue_entity(group_cfs_rq(a[0]), b[0]) ... We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0 which violates invariants in several code-paths. Eliminate the possibility of this by initializing group entity weight. Change-Id: I1b09c5c4f85df1de565e7c862ba8dde65851c05f Signed-off-by: Paul Turner <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <[email protected]> Cc: Chris J Arges <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Fix isolated CPUs leaving root_task_group indefinitely throttled Root task group bandwidth replenishment must service all CPUs, regardless of where the timer was last started, and regardless of the isolation mechanism, lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy. Change-Id: I8485724e3ff0960008aa13634e55bcb6ceacbc01 Signed-off-by: Mike Galbraith <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: rt: Avoid updating RT entry timeout twice within one tick period The issue below was found in 2.6.34-rt rather than mainline rt kernel, but the issue still exists upstream as well. So please let me describe how it was noticed on 2.6.34-rt: On this version, each softirq has its own thread, which means there is at least one RT FIFO task per cpu. The priority of these tasks is set to 49 by default. If user launches an RT FIFO task with priority lower than 49 of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued one cpu runqueue at one moment. By current strategy of balancing RT tasks, when it comes to RT tasks, we really need to put them off to a CPU that they can run on as soon as possible. Even if it means a bit of cache line flushing, we want RT tasks to be run with the least latency. When the user RT FIFO task which just launched before is running, the sched timer tick of the current cpu happens. In this tick period, the timeout value of the user RT task will be updated once. Subsequently, we try to wake up one softirq RT task on its local cpu. As the priority of current user RT task is lower than the softirq RT task, the current task will be preempted by the higher priority softirq RT task. Before preemption, we check to see if current can readily move to a different cpu. If so, we will reschedule to allow the RT push logic to try to move current somewhere else. Whenever the woken softirq RT task runs, it first tries to migrate the user FIFO RT task over to a cpu that is running a task of lesser priority. If migration is done, it will send a reschedule request to the found cpu by IPI interrupt. Once the target cpu responds the IPI interrupt, it will pick the migrated user RT task to preempt its current task. When the user RT task is running on the new cpu, the sched timer tick of the cpu fires. So it will tick the user RT task again. This also means the RT task timeout value will be updated again. As the migration may be done in one tick period, it means the user RT task timeout value will be updated twice within one tick. If we set a limit on the amount of cpu time for the user RT task by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon reaching the soft limit. But exactly when the SIGXCPU signal should be sent depends on the RT task timeout value. In fact the timeout mechanism of sending the SIGXCPU signal assumes the RT task timeout is increased once every tick. However, currently the timeout value may be added twice per tick. So it results in the SIGXCPU signal being sent earlier than expected. To solve this issue, we prevent the timeout value from increasing twice within one tick time by remembering the jiffies value of last updating the timeout. As long as the RT task's jiffies is different with the global jiffies value, we allow its timeout to be updated. Signed-off-by: Ying Xue <[email protected]> Signed-off-by: Fan Du <[email protected]> Reviewed-by: Yong Zhang <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> [ lizf: backported to 3.4: adjust context ] Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Change-Id: I1d126959d88f3bc1adb4a707661efb7162048424 sched: rt: Reduce rq lock contention by eliminating locking of non-feasible target This patch adds checks that prevens futile attempts to move rt tasks to a CPU with active tasks of equal or higher priority. This reduces run queue lock contention and improves the performance of a well known OLTP benchmark by 0.7%. Change-Id: I232abe50623593d51be508fc64e8b865dcc77212 Signed-off-by: Tim Chen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Shawn Bohrer <[email protected]> Cc: Suruchi Kadu <[email protected]> Cc: Doug Nelson<[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Zefan Li <[email protected]> sched: Fix RLIMIT_RTTIME when PI-boosting to RT When non-realtime tasks get priority-inheritance boosted to a realtime scheduling class, RLIMIT_RTTIME starts to apply to them. However, the counter used for checking this (the same one used for SCHED_RR timeslices) was not getting reset. This meant that tasks running with a non-realtime scheduling class which are repeatedly boosted to a realtime one, but never block while they are running realtime, eventually hit the timeout without ever running for a time over the limit. This patch resets the realtime timeslice counter when un-PI-boosting from an RT to a non-RT scheduling class. I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT mutex which induces priority boosting and spins while boosted that gets killed by a SIGXCPU on non-fixed kernels but doesn't with this patch applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and does happen eventually with PREEMPT_VOLUNTARY kernels. Change-Id: I732036119517d9a9625cea4644b4509c1d6cdf13 Signed-off-by: Brian Silverman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> [lizf: Backported to 3.4: adjust contest] Signed-off-by: Zefan Li <[email protected]> sched: Queue RT tasks to head when prio drops The following scenario does not work correctly: Runqueue of CPUx contains two runnable and pinned tasks: T1: SCHED_FIFO, prio 80 T2: SCHED_FIFO, prio 80 T1 is on the cpu and executes the following syscalls (classic priority ceiling scenario): sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90); ... sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80); ... Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back to sleep the scheduler picks T2. Surprise! The same happens w/o actual preemption when T1 is forced into the scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes pick_next_task() which returns T2. So T1 gets preempted and scheduled out. This happens because sched_setscheduler() dequeues T1 from the prio 90 list and then enqueues it on the tail of the prio 80 list behind T2. This violates the POSIX spec and surprises user space which relies on the guarantee that SCHED_FIFO tasks are not scheduled out unless they give the CPU up voluntarily or are preempted by a higher priority task. In the latter case the preempted task must get back on the CPU after the preempting task schedules out again. We fixed a similar issue already in commit 60db48c (sched: Queue a deboosted task to the head of the RT prio queue). The same treatment is necessary for sched_setscheduler(). So enqueue to head of the prio bucket list if the priority of the task is lowered. It might be possible that existing user space relies on the current behaviour, but it can be considered highly unlikely due to the corner case nature of the application scenario. Change-Id: I65f159b3c2d265dcaf27b70bf9be88cf922ab845 Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Zefan Li <[email protected]> sched: Fix the theoretical signal_wake_up() vs schedule() race This is only theoretical, but after try_to_wake_up(p) was changed to check p->state under p->pi_lock the code like: __set_current_state(TASK_INTERRUPTIBLE); schedule(); can miss a signal. This is the special case of wait-for-condition, it relies on try_to_wake_up/schedule interaction and thus it does not need mb() between __set_current_state() and if (signal_pending). However, this __set_current_state() can move into the critical section protected by rq->lock, now that try_to_wake_up() takes another lock we need to ensure that it can't be reordered with "if (signal_pending(current))" check inside that section. The patch is actually one-liner, it simply adds smp_wmb() before spin_lock_irq(rq->lock). This is what try_to_wake_up() already does by the same reason. We turn this wmb() into the new helper, smp_mb__before_spinlock(), for better documentation and to allow the architectures to change the default implementation. While at it, kill smp_mb__after_lock(), it has no callers. Perhaps we can also add smp_mb__before/after_spinunlock() for prepare_to_wait(). Change-Id: I2b329b2ae213cc9221811ca244d2ebb044011a1b Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix race in idle_balance() The scheduler main function 'schedule()' checks if there are no more tasks on the runqueue. Then it checks if a task should be pulled in the current runqueue in idle_balance() assuming it will go to idle otherwise. But idle_balance() releases the rq->lock in order to look up the sched domains and takes the lock again right after. That opens a window where another cpu may put a task in our runqueue, so we won't go to idle but we have filled the idle_stamp, thinking we will. This patch closes the window by checking if the runqueue has been modified but without pulling a task after taking the lock again, so we won't go to idle right after in the __schedule() function. Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8 Signed-off-by: Daniel Lezcano <[email protected]> Cc: [email protected] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> sched: Fix load avg vs cpu-hotplug Rabik and Paul reported two different issues related to the same few lines of code. Rabik's issue is that the nr_uninterruptible migration code is wrong in that he sees artifacts due to this (Rabik please do expand in more detail). Paul's issue is that this code as it stands relies on us using stop_machine() for unplug, we all would like to remove this assumption so that eventually we can remove this stop_machine() usage altogether. The only reason we'd have to migrate nr_uninterruptible is so that we could use for_each_online_cpu() loops in favour of for_each_possible_cpu() loops, however since nr_uninterruptible() is the only such loop and its using possible lets not bother at all. The problem Rabik sees is (probably) caused by the fact that by migrating nr_uninterruptible we screw rq->calc_load_active for both rqs involved. So don't bother with fancy migration schemes (meaning we now have to keep using for_each_possible_cpu()) and instead fold any nr_active delta after we migrate all tasks away to make sure we don't have any skewed nr_active accounting. Reported-by: Rakib Mullick <[email protected]> Reported-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins Signed-off-by: Ingo Molnar <[email protected]> sched: Fix load avg vs. cpu-hotplug Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an incomplete fix: In particular, the problem is that at the point it calls calc_load_migrate() nr_running := 1 (the stopper thread), so move the call to CPU_DEAD where we're sure that nr_running := 0. Also note that we can call calc_load_migrate() without serialization, we know the state of rq is stable since its cpu is dead, and we modify the global state using appropriate atomic ops. Change-Id: I24063eebf4429d7e01838af2cb68e0f3c2cf2131 Suggested-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins Signed-off-by: Ingo Molnar <[email protected]> sched/nohz: Rewrite and fix load-avg computation -- again Thanks to Charles Wang for spotting the defects in the current code: - If we go idle during the sample window -- after sampling, we get a negative bias because we can negate our own sample. - If we wake up during the sample window we get a positive bias because we push the sample to a known active period. So rewrite the entire nohz load-avg muck once again, now adding copious documentation to the code. Reported-and-tested-by: Doug Smythies <[email protected]> Reported-and-tested-by: Charles Wang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins [ minor edits ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Change-Id: Ib735002c79eaace9fa5f7795fdc2f80b638b56a7 sched: Fix the broken sched_rr_get_interval() The caller of sched_sliced() should pass se.cfs_rq and se as the arguments, however in sched_rr_get_interval() we gave it rq.cfs_rq and se, which made the following computation obviously wrong. The change was introduced by commit: 77034937dc45 sched: fix crash in sys_sched_rr_get_interval() ... 5 years ago, while it had been the correct 'cfs_rq_of' before the commit. The change seems to be irrelevant to the commit msg, which was to return a 0 timeslice for tasks that are on an idle runqueue. So I believe that was just a plain typo. Signed-off-by: Zhu Yanhai <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul Turner <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] [ Since this is an ABI and an old bug, we'll test this via a slow upstream route, to hopefully discover any app breakage. ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched: Fix race in task_group() Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call task_group() too many times in set_task_rq()"). He found the reason to be that the multiple task_group() invocations in set_task_rq() returned different values. Looking at all that I found a lack of serialization and plain wrong comments. The below tries to fix it using an extra pointer which is updated under the appropriate scheduler locks. Its not pretty, but I can't really see another way given how all the cgroup stuff works. Change-Id: I6906c92ea6f66d19af5938a9f5ec9f1ff4577c39 Reported-and-tested-by: Stefan Bader <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed masks as and when necessary to ensure that the tasks belonging to the cpusets have some place (online CPUs) to run on. And regular CPU hotplug is destructive in the sense that the kernel doesn't remember the original cpuset configurations set by the user, across hotplug operations. However, suspend/resume (which uses CPU hotplug) is a special case in which the kernel has the responsibility to restore the system (during resume), to exactly the same state it was in before suspend. In order to achieve that, do the following: 1. Don't modify cpusets during suspend/resume. At all. In particular, don't move the tasks from one cpuset to another, and don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets during the CPU hotplug operations that are carried out in the suspend/resume path. 2. However, cpusets and sched domains are related. We just want to avoid altering cpusets alone. So, to keep the sched domains updated, build a single sched domain (containing all active cpus) during each of the CPU hotplug operations carried out in s/r path, effectively ignoring the cpusets' cpus_allowed masks. (Since userspace is frozen while doing all this, it will go unnoticed.) 3. During the last CPU online operation during resume, build the sched domains by looking up the (unaltered) cpusets' cpus_allowed masks. That will bring back the system to the same original state as it was in before suspend. Ultimately, this will not only solve the cpuset problem related to suspend resume (ie., restores the cpusets to exactly what it was before suspend, by not touching it at all) but also speeds up suspend/resume because we avoid running cpuset update code for every CPU being offlined/onlined. Change-Id: I12c3e6289d5aa069531f8f616d63412b38103d99 Signed-off-by: Srivatsa S. Bhat <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Preeti U Murthy <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones There is a small race between copy_process() and cgroup_attach_task() where child->se.parent,cfs_rq points to invalid (old) ones. parent doing fork() | someone moving the parent to another cgroup -------------------------------+--------------------------------------------- copy_process() + dup_task_struct() -> parent->se is copied to child->se. se.parent,cfs_rq of them point to old ones. cgroup_attach_task() + cgroup_task_migrate() -> parent->cgroup is updated. + cpu_cgroup_attach() + sched_move_task() + task_move_group_fair() +- set_task_rq() -> se.parent,cfs_rq of parent are updated. + cgroup_fork() -> parent->cgroup is copied to child->cgroup. (*1) + sched_fork() + task_fork_fair() -> se.parent,cfs_rq of child are accessed while they point to old ones. (*2) In the worst case, this bug can lead to "use-after-free" and cause a panic, because it's new cgroup's refcount that is incremented at (*1), so the old cgroup(and related data) can be freed before (*2). In fact, a panic caused by this bug was originally caught in RHEL6.4. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0 [...] Call Trace: [<ffffffff81051f25>] place_entity+0x75/0xa0 [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160 [<ffffffff81063c0b>] sched_fork+0x6b/0x140 [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450 [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130 [<ffffffff8106d2f4>] do_fork+0x94/0x460 [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100 [<ffffffff81009598>] sys_clone+0x28/0x30 [<ffffffff8100b393>] stub_clone+0x13/0x20 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Daisuke Nishimura <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/nohz: Fix rq->cpu_load[] calculations While investigating why the load-balancer did funny I found that the rq->cpu_load[] tables were completely screwy.. a bit more digging revealed that the updates that got through were missing ticks followed by a catchup of 2 ticks. The catchup assumes the cpu was idle during that time (since only nohz can cause missed ticks and the machine is idle etc..) this means that esp. the higher indices were significantly lower than they ought to be. The reason for this is that its not correct to compare against jiffies on every jiffy on any other cpu than the cpu that updates jiffies. This patch cludges around it by only doing the catch-up stuff from nohz_idle_balance() and doing the regular stuff unconditionally from the tick. Change-Id: Iac6eda1ded5074afb261ccc0fe8a4e70432bb031 Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Venkatesh Pallipadi <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/nohz: Fix rq->cpu_load calculations some more commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream. Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[] calculations") since while that fixed the busy case it regressed the mostly idle case. Add a callback from the nohz exit to also age the rq->cpu_load[] array. This closes the hole where either there was no nohz load balance pass during the nohz, or there was a 'significant' amount of idle time between the last nohz balance and the nohz exit. So we'll update unconditionally from the tick to not insert any accidental 0 load periods while busy, and we try and catch up from nohz idle balance and nohz exit. Both these are still prone to missing a jiffy, but that has always been the case. Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Venkatesh Pallipadi <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> sched/core: Clear the root_domain cpumasks in init_rootdomain() root_domain::rto_mask allocated through alloc_cpumask_var() contains garbage data, and this may cause problems. For instance, when doing pull_rt_task(), it may do useless iterations if rto_mask retains some extra garbage bits. Worse still, this violates the isolated domain rule for clustered scheduling using cpuset, because the tasks (with all the cpus allowed) belongs to one root domain can be pulled away into another root domain. The patch cleans the garbage by using zalloc_cpumask_var() instead of alloc_cpumask_var() for root_domain::rto_mask allocation, thereby addressing the issues. Do the same thing for root_domain's other cpumask memembers: - span, - online; Signed-off-by: Xunlei Pang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> softirq: Reduce latencies In various network workloads, __do_softirq() latencies can be up to 20 ms if HZ=1000, and 200 ms if HZ=100. This is because we iterate 10 times in the softirq dispatcher, and some actions can consume a lot of cycles. This patch changes the fallback to ksoftirqd condition to: - A time limit of 2 ms, - need_resched() being set on current task; When one of this condition is met, we wakeup ksoftirqd for further softirq processing if we still have pending softirqs. Using need_resched() as the only condition can trigger RCU stalls, as we can keep BH disabled for too long. I ran several benchmarks and got no significant difference in throughput, but a very significant reduction of latencies (one order of magnitude): In following bench, 200 antagonist "netperf -t TCP_RR" are started in background, using all available cpus. Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC IRQ (hard + soft). Before patch: RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=550110.424 MIN_LATENCY=146858 MAX_LATENCY=997109 P50_LATENCY=305000 P90_LATENCY=550000 P99_LATENCY=710000 MEAN_LATENCY=376989.12 STDDEV_LATENCY=184046.92 After patch: RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=40545.492 MIN_LATENCY=9834 MAX_LATENCY=78366 P50_LATENCY=33583 P90_LATENCY=59000 P99_LATENCY=69000 MEAN_LATENCY=38364.67 STDDEV_LATENCY=12865.26 Signed-off-by: Eric Dumazet <[email protected]> Cc: David Miller <[email protected]> Cc: Tom Herbert <[email protected]> Cc: Ben Hutchings <[email protected]> Signed-off-by: David S. Miller <[email protected]> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <[email protected]> Signed-off-by: Zefan Li <[email protected]> softirq: Fix lockup related to stop_machine being stuck in __do_softirq. The stop machine logic can lock up if all but one of the migration threads make it through the disable-irq step and the one remaining thread gets stuck in __do_softirq. The reason __do_softirq can hang is that it has a bail-out based on jiffies timeout, but in the lockup case, jiffies itself is not incremented. To work around this, re-add the max_restart counter in __do_irq and stop processing irqs after 10 restarts. Thanks to Tejun Heo and Rusty Russell and others for helping me track this down. This was introduced in 3.9 by commit c10d73671ad3 ("softirq: reduce latencies"). It may be worth looking into ath9k to see if it has issues with its irq handler at a later date. The hang stack traces look something like this: ------------[ cut here ]------------ WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7() Watchdog detected hard LOCKUP on cpu 2 Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] Pid: 23, comm: migration/2 Tainted: G C 3.9.4+ #11 Call Trace: <NMI> warn_slowpath_common+0x85/0x9f warn_slowpath_fmt+0x46/0x48 watchdog_overflow_callback+0x9c/0xa7 __perf_event_overflow+0x137/0x1cb perf_event_overflow+0x14/0x16 intel_pmu_handle_irq+0x2dc/0x359 perf_event_nmi_handler+0x19/0x1b nmi_handle+0x7f/0xc2 do_nmi+0xbc/0x304 end_repeat_nmi+0x1e/0x2e <<EOE>> cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 ---[ end trace 4947dfa9b0a4cec3 ]--- BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17] Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] irq event stamp: 835637905 hardirqs last enabled at (835637904): __do_softirq+0x9f/0x257 hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80 softirqs last enabled at (5654720): __do_softirq+0x1ff/0x257 softirqs last disabled at (5654725): irq_exit+0x5f/0xbb CPU 1 Pid: 17, comm: migration/1 Tainted: G WC 3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: tasklet_hi_action+0xf0/0xf0 Process migration/1 Call Trace: <IRQ> __do_softirq+0x117/0x257 irq_exit+0x5f/0xbb smp_apic_timer_interrupt+0x8a/0x98 apic_timer_interrupt+0x72/0x80 <EOI> printk+0x4d/0x4f stop_machine_cpu_stop+0x22c/0x274 cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 Change-Id: Ib213d5a5c59c85ae16afde6f42a01e39d2c93fd7 Signed-off-by: Ben Greear <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Pekka Riikonen <[email protected]> Cc: Eric Dumazet <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <[email protected]> Signed-off-by: Zefan Li <[email protected]>
alesaiko
pushed a commit
that referenced
this issue
Sep 30, 2018
ARM has some private syscalls (for example, set_tls(2)) which lie outside the range of NR_syscalls. If any of these are called while syscall tracing is being performed, out-of-bounds array access will occur in the ftrace and perf sys_{enter,exit} handlers. # trace-cmd record -e raw_syscalls:* true && trace-cmd report ... true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0) true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264 true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1) true-653 [000] 384.675988: sys_exit: NR 983045 = 0 ... # trace-cmd record -e syscalls:* true [ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace [ 17.289590] pgd = 9e71c000 [ 17.289696] [aaaaaace] *pgd=00000000 [ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM [ 17.290169] Modules linked in: [ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21 [ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000 [ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8 [ 17.290866] LR is at syscall_trace_enter+0x124/0x184 Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers. Commit cd0980f "tracing: Check invalid syscall nr while tracing syscalls" added the check for less than zero, but it should have also checked for greater than NR_syscalls. Link: http://lkml.kernel.org/p/[email protected] Fixes: cd0980f "tracing: Check invalid syscall nr while tracing syscalls" Cc: [email protected] # 2.6.33+ Signed-off-by: Rabin Vincent <[email protected]> Signed-off-by: Steven Rostedt <[email protected]> Change-Id: I512142f8f1e1b2a8dc063209666dbce9737377e7
alesaiko
pushed a commit
that referenced
this issue
Sep 30, 2018
If a user key gets negatively instantiated, an error code is cached in the payload area. A negatively instantiated key may be then be positively instantiated by updating it with valid data. However, the ->update key type method must be aware that the error code may be there. The following may be used to trigger the bug in the user key type: keyctl request2 user user "" @U keyctl add user user "a" @U which manifests itself as: BUG: unable to handle kernel paging request at 00000000ffffff8a IP: [<ffffffff810a376f>] __call_rcu.constprop.76+0x1f/0x280 kernel/rcu/tree.c:3046 PGD 7cc30067 PUD 0 Oops: 0002 [#1] SMP Modules linked in: CPU: 3 PID: 2644 Comm: a.out Not tainted 4.3.0+ #49 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88003ddea700 ti: ffff88003dd88000 task.ti: ffff88003dd88000 RIP: 0010:[<ffffffff810a376f>] [<ffffffff810a376f>] __call_rcu.constprop.76+0x1f/0x280 [<ffffffff810a376f>] __call_rcu.constprop.76+0x1f/0x280 kernel/rcu/tree.c:3046 RSP: 0018:ffff88003dd8bdb0 EFLAGS: 00010246 RAX: 00000000ffffff82 RBX: 0000000000000000 RCX: 0000000000000001 RDX: ffffffff81e3fe40 RSI: 0000000000000000 RDI: 00000000ffffff82 RBP: ffff88003dd8bde0 R08: ffff88007d2d2da0 R09: 0000000000000000 R10: 0000000000000000 R11: ffff88003e8073c0 R12: 00000000ffffff82 R13: ffff88003dd8be68 R14: ffff88007d027600 R15: ffff88003ddea700 FS: 0000000000b92880(0063) GS:ffff88007fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000ffffff8a CR3: 000000007cc5f000 CR4: 00000000000006e0 Stack: ffff88003dd8bdf0 ffffffff81160a8a 0000000000000000 00000000ffffff82 ffff88003dd8be68 ffff88007d027600 ffff88003dd8bdf0 ffffffff810a39e5 ffff88003dd8be20 ffffffff812a31ab ffff88007d027600 ffff88007d027620 Call Trace: [<ffffffff810a39e5>] kfree_call_rcu+0x15/0x20 kernel/rcu/tree.c:3136 [<ffffffff812a31ab>] user_update+0x8b/0xb0 security/keys/user_defined.c:129 [< inline >] __key_update security/keys/key.c:730 [<ffffffff8129e5c1>] key_create_or_update+0x291/0x440 security/keys/key.c:908 [< inline >] SYSC_add_key security/keys/keyctl.c:125 [<ffffffff8129fc21>] SyS_add_key+0x101/0x1e0 security/keys/keyctl.c:60 [<ffffffff8185f617>] entry_SYSCALL_64_fastpath+0x12/0x6a arch/x86/entry/entry_64.S:185 Note the error code (-ENOKEY) in EDX. A similar bug can be tripped by: keyctl request2 trusted user "" @U keyctl add trusted user "a" @U This should also affect encrypted keys - but that has to be correctly parameterised or it will fail with EINVAL before getting to the bit that will crashes. Change-Id: I171d566f431c56208e1fe279f466d2d399a9ac7c Reported-by: Dmitry Vyukov <[email protected]> Signed-off-by: David Howells <[email protected]> Acked-by: Mimi Zohar <[email protected]> Signed-off-by: James Morris <[email protected]>
alesaiko
pushed a commit
that referenced
this issue
Sep 30, 2018
An issue was observed when a userspace task exits. The page which hits error here is the zero page. In binder mmap, the whole of vma is not mapped. On a task crash, when debuggerd reads the binder regions, the unmapped areas fall to do_anonymous_page in handle_pte_fault, due to the absence of a vm_fault handler. This results in zero page being mapped. Later in zap_pte_range, vm_normal_page returns zero page in the case of VM_MIXEDMAP and it results in the error. BUG: Bad page map in process mediaserver pte:9dff379f pmd:9bfbd831 page:c0ed8e60 count:1 mapcount:-1 mapping: (null) index:0x0 page flags: 0x404(referenced|reserved) addr:40c3f000 vm_flags:10220051 anon_vma: (null) mapping:d9fe0764 index:fd vma->vm_ops->fault: (null) vma->vm_file->f_op->mmap: binder_mmap+0x0/0x274 CPU: 0 PID: 1463 Comm: mediaserver Tainted: G W 3.10.17+ #1 [<c001549c>] (unwind_backtrace+0x0/0x11c) from [<c001200c>] (show_stack+0x10/0x14) [<c001200c>] (show_stack+0x10/0x14) from [<c0103d78>] (print_bad_pte+0x158/0x190) [<c0103d78>] (print_bad_pte+0x158/0x190) from [<c01055f0>] (unmap_single_vma+0x2e4/0x598) [<c01055f0>] (unmap_single_vma+0x2e4/0x598) from [<c010618c>] (unmap_vmas+0x34/0x50) [<c010618c>] (unmap_vmas+0x34/0x50) from [<c010a9e4>] (exit_mmap+0xc8/0x1e8) [<c010a9e4>] (exit_mmap+0xc8/0x1e8) from [<c00520f0>] (mmput+0x54/0xd0) [<c00520f0>] (mmput+0x54/0xd0) from [<c005972c>] (do_exit+0x360/0x990) [<c005972c>] (do_exit+0x360/0x990) from [<c0059ef0>] (do_group_exit+0x84/0xc0) [<c0059ef0>] (do_group_exit+0x84/0xc0) from [<c0066de0>] (get_signal_to_deliver+0x4d4/0x548) [<c0066de0>] (get_signal_to_deliver+0x4d4/0x548) from [<c0011500>] (do_signal+0xa8/0x3b8) Add a vm_fault handler which returns VM_FAULT_SIGBUS, and prevents the wrong fallback to do_anonymous_page. Signed-off-by: Vinayak Menon <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
alesaiko
pushed a commit
that referenced
this issue
Sep 30, 2018
Fix a short sprintf buffer in proc_keys_show(). If the gcc stack protector is turned on, this can cause a panic due to stack corruption. The problem is that xbuf[] is not big enough to hold a 64-bit timeout rendered as weeks: (gdb) p 0xffffffffffffffffULL/(60*60*24*7) $2 = 30500568904943 That's 14 chars plus NUL, not 11 chars plus NUL. Expand the buffer to 16 chars. I think the unpatched code apparently works if the stack-protector is not enabled because on a 32-bit machine the buffer won't be overflowed and on a 64-bit machine there's a 64-bit aligned pointer at one side and an int that isn't checked again on the other side. The panic incurred looks something like: Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff81352ebe CPU: 0 PID: 1692 Comm: reproducer Not tainted 4.7.2-201.fc24.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 0000000000000086 00000000fbbd2679 ffff8800a044bc00 ffffffff813d941f ffffffff81a28d58 ffff8800a044bc98 ffff8800a044bc88 ffffffff811b2cb6 ffff880000000010 ffff8800a044bc98 ffff8800a044bc30 00000000fbbd2679 Call Trace: [<ffffffff813d941f>] dump_stack+0x63/0x84 [<ffffffff811b2cb6>] panic+0xde/0x22a [<ffffffff81352ebe>] ? proc_keys_show+0x3ce/0x3d0 [<ffffffff8109f7f9>] __stack_chk_fail+0x19/0x30 [<ffffffff81352ebe>] proc_keys_show+0x3ce/0x3d0 [<ffffffff81350410>] ? key_validate+0x50/0x50 [<ffffffff8134db30>] ? key_default_cmp+0x20/0x20 [<ffffffff8126b31c>] seq_read+0x2cc/0x390 [<ffffffff812b6b12>] proc_reg_read+0x42/0x70 [<ffffffff81244fc7>] __vfs_read+0x37/0x150 [<ffffffff81357020>] ? security_file_permission+0xa0/0xc0 [<ffffffff81246156>] vfs_read+0x96/0x130 [<ffffffff81247635>] SyS_read+0x55/0xc0 [<ffffffff817eb872>] entry_SYSCALL_64_fastpath+0x1a/0xa4 Change-Id: I0787d5a38c730ecb75d3c08f28f0ab36295d59e7 Reported-by: Ondrej Kozina <[email protected]> Signed-off-by: David Howells <[email protected]> Tested-by: Ondrej Kozina <[email protected]>
alesaiko
pushed a commit
that referenced
this issue
Sep 30, 2018
There is a race between the TX path and the STA wakeup: while a station is sleeping, mac80211 buffers frames until it wakes up, then the frames are transmitted. However, the RX and TX path are concurrent, so the packet indicating wakeup can be processed while a packet is being transmitted. This can lead to a situation where the buffered frames list is emptied on the one side, while a frame is being added on the other side, as the station is still seen as sleeping in the TX path. As a result, the newly added frame will not be send anytime soon. It might be sent much later (and out of order) when the station goes to sleep and wakes up the next time. Additionally, it can lead to the crash below. Fix all this by synchronising both paths with a new lock. Both path are not fastpath since they handle PS situations. In a later patch we'll remove the extra skb queue locks to reduce locking overhead. BUG: unable to handle kernel NULL pointer dereference at 000000b0 IP: [<ff6f1791>] ieee80211_report_used_skb+0x11/0x3e0 [mac80211] *pde = 00000000 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC EIP: 0060:[<ff6f1791>] EFLAGS: 00210282 CPU: 1 EIP is at ieee80211_report_used_skb+0x11/0x3e0 [mac80211] EAX: e5900da0 EBX: 00000000 ECX: 00000001 EDX: 00000000 ESI: e41d00c0 EDI: e5900da0 EBP: ebe458e4 ESP: ebe458b0 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 8005003b CR2: 000000b0 CR3: 25a78000 CR4: 000407d0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 Process iperf (pid: 3934, ti=ebe44000 task=e757c0b0 task.ti=ebe44000) iwlwifi 0000:02:00.0: I iwl_pcie_enqueue_hcmd Sending command LQ_CMD (#4e), seq: 0x0903, 92 bytes at 3[3]:9 Stack: e403b32c ebe458c4 00200002 00200286 e403b338 ebe458cc c10960bb e5900da0 ff76a6ec ebe458d8 00000000 e41d00c0 e5900da0 ebe458f0 ff6f1b75 e403b210 ebe4598c ff723dc1 00000000 ff76a6ec e597c978 e403b758 00000002 00000002 Call Trace: [<ff6f1b75>] ieee80211_free_txskb+0x15/0x20 [mac80211] [<ff723dc1>] invoke_tx_handlers+0x1661/0x1780 [mac80211] [<ff7248a5>] ieee80211_tx+0x75/0x100 [mac80211] [<ff7249bf>] ieee80211_xmit+0x8f/0xc0 [mac80211] [<ff72550e>] ieee80211_subif_start_xmit+0x4fe/0xe20 [mac80211] [<c149ef70>] dev_hard_start_xmit+0x450/0x950 [<c14b9aa9>] sch_direct_xmit+0xa9/0x250 [<c14b9c9b>] __qdisc_run+0x4b/0x150 [<c149f732>] dev_queue_xmit+0x2c2/0xca0 Cc: [email protected] Reported-by: Yaara Rozenblum <[email protected]> Signed-off-by: Emmanuel Grumbach <[email protected]> Reviewed-by: Stanislaw Gruszka <[email protected]> [reword commit log, use a separate lock] Signed-off-by: Johannes Berg <[email protected]> CVE-2014-2706 Change-Id: I555e82d71add385f300ecc20af5d5d8b69246c52 (cherry picked from commit 1d147bfa64293b2723c4fec50922168658e613ba)
alesaiko
pushed a commit
that referenced
this issue
Sep 30, 2018
This fixes CVE-2016-8650. If mpi_powm() is given a zero exponent, it wants to immediately return either 1 or 0, depending on the modulus. However, if the result was initalised with zero limb space, no limbs space is allocated and a NULL-pointer exception ensues. Fix this by allocating a minimal amount of limb space for the result when the 0-exponent case when the result is 1 and not touching the limb space when the result is 0. This affects the use of RSA keys and X.509 certificates that carry them. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8138ce5d>] mpi_powm+0x32/0x7e6 PGD 0 Oops: 0002 [#1] SMP Modules linked in: CPU: 3 PID: 3014 Comm: keyctl Not tainted 4.9.0-rc6-fscache+ #278 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 task: ffff8804011944c0 task.stack: ffff880401294000 RIP: 0010:[<ffffffff8138ce5d>] [<ffffffff8138ce5d>] mpi_powm+0x32/0x7e6 RSP: 0018:ffff880401297ad8 EFLAGS: 00010212 RAX: 0000000000000000 RBX: ffff88040868bec0 RCX: ffff88040868bba0 RDX: ffff88040868b260 RSI: ffff88040868bec0 RDI: ffff88040868bee0 RBP: ffff880401297ba8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000047 R11: ffffffff8183b210 R12: 0000000000000000 R13: ffff8804087c7600 R14: 000000000000001f R15: ffff880401297c50 FS: 00007f7a7918c700(0000) GS:ffff88041fb80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000401250000 CR4: 00000000001406e0 Stack: ffff88040868bec0 0000000000000020 ffff880401297b00 ffffffff81376cd4 0000000000000100 ffff880401297b10 ffffffff81376d12 ffff880401297b30 ffffffff81376f37 0000000000000100 0000000000000000 ffff880401297ba8 Call Trace: [<ffffffff81376cd4>] ? __sg_page_iter_next+0x43/0x66 [<ffffffff81376d12>] ? sg_miter_get_next_page+0x1b/0x5d [<ffffffff81376f37>] ? sg_miter_next+0x17/0xbd [<ffffffff8138ba3a>] ? mpi_read_raw_from_sgl+0xf2/0x146 [<ffffffff8132a95c>] rsa_verify+0x9d/0xee [<ffffffff8132acca>] ? pkcs1pad_sg_set_buf+0x2e/0xbb [<ffffffff8132af40>] pkcs1pad_verify+0xc0/0xe1 [<ffffffff8133cb5e>] public_key_verify_signature+0x1b0/0x228 [<ffffffff8133d974>] x509_check_for_self_signed+0xa1/0xc4 [<ffffffff8133cdde>] x509_cert_parse+0x167/0x1a1 [<ffffffff8133d609>] x509_key_preparse+0x21/0x1a1 [<ffffffff8133c3d7>] asymmetric_key_preparse+0x34/0x61 [<ffffffff812fc9f3>] key_create_or_update+0x145/0x399 [<ffffffff812fe227>] SyS_add_key+0x154/0x19e [<ffffffff81001c2b>] do_syscall_64+0x80/0x191 [<ffffffff816825e4>] entry_SYSCALL64_slow_path+0x25/0x25 Code: 56 41 55 41 54 53 48 81 ec a8 00 00 00 44 8b 71 04 8b 42 04 4c 8b 67 18 45 85 f6 89 45 80 0f 84 b4 06 00 00 85 c0 75 2f 41 ff ce <49> c7 04 24 01 00 00 00 b0 01 75 0b 48 8b 41 18 48 83 38 01 0f RIP [<ffffffff8138ce5d>] mpi_powm+0x32/0x7e6 RSP <ffff880401297ad8> CR2: 0000000000000000 ---[ end trace d82015255d4a5d8d ]--- Basically, this is a backport of a libgcrypt patch: http://git.gnupg.org/cgi-bin/gitweb.cgi?p=libgcrypt.git;a=patch;h=6e1adb05d290aeeb1c230c763970695f4a538526 Fixes: cdec9cb ("crypto: GnuPG based MPI lib - source files (part 1)") Change-Id: Ib4ad146d7488281260dade0d8d1a4f54fa7bd6ae Signed-off-by: Andrey Ryabinin <[email protected]> Signed-off-by: David Howells <[email protected]> cc: Dmitry Kasatkin <[email protected]> cc: [email protected] cc: [email protected] Signed-off-by: James Morris <[email protected]>
alesaiko
pushed a commit
that referenced
this issue
Sep 30, 2018
The class of 4 n_hdls buf locks is the same because a single function n_hdlc_buf_list_init is used to init all the locks. But since flush_tx_queue takes n_hdlc->tx_buf_list.spinlock and then calls n_hdlc_buf_put which takes n_hdlc->tx_free_buf_list.spinlock, lockdep emits a warning: ============================================= [ INFO: possible recursive locking detected ] 4.3.0-25.g91e30a7-default #1 Not tainted --------------------------------------------- a.out/1248 is trying to acquire lock: (&(&list->spinlock)->rlock){......}, at: [<ffffffffa01fd020>] n_hdlc_buf_put+0x20/0x60 [n_hdlc] but task is already holding lock: (&(&list->spinlock)->rlock){......}, at: [<ffffffffa01fdc07>] n_hdlc_tty_ioctl+0x127/0x1d0 [n_hdlc] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&list->spinlock)->rlock); lock(&(&list->spinlock)->rlock); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by a.out/1248: #0: (&tty->ldisc_sem){++++++}, at: [<ffffffff814c9eb0>] tty_ldisc_ref_wait+0x20/0x50 #1: (&(&list->spinlock)->rlock){......}, at: [<ffffffffa01fdc07>] n_hdlc_tty_ioctl+0x127/0x1d0 [n_hdlc] ... Call Trace: ... [<ffffffff81738fd0>] _raw_spin_lock_irqsave+0x50/0x70 [<ffffffffa01fd020>] n_hdlc_buf_put+0x20/0x60 [n_hdlc] [<ffffffffa01fdc24>] n_hdlc_tty_ioctl+0x144/0x1d0 [n_hdlc] [<ffffffff814c25c1>] tty_ioctl+0x3f1/0xe40 ... Fix it by initializing the spin_locks separately. This removes also reduntand memset of a freshly kzallocated space. Change-Id: I32bc83c9e19953672857fe8182107772411d471a Signed-off-by: Jiri Slaby <[email protected]> Reported-by: Dmitry Vyukov <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Ошибка 7 при установке на flo + lineage os. Последняя версия 0628 работает нормально
The text was updated successfully, but these errors were encountered: