提交 · 432bc30040120d508fd45dd0d140a8345a8132cc · openanolis / cloud-kernel

18 3月, 2020 40 次提交

md: no longer compare spare disk superblock events in super_load · 432bc300

由 Yufen Yu 提交于 10月 16, 2019

commit 6a5cb53aaa4ef515ddeffa04ce18b771121127b4 upstream.

We have a test case as follow:

  mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \
	--assume-clean --bitmap=internal
  mdadm -S /dev/md1
  mdadm -A /dev/md1 /dev/sd[b-c] --run --force

  mdadm --zero /dev/sda
  mdadm /dev/md1 -a /dev/sda

  echo offline > /sys/block/sdc/device/state
  echo offline > /sys/block/sdb/device/state
  sleep 5
  mdadm -S /dev/md1

  echo running > /sys/block/sdb/device/state
  echo running > /sys/block/sdc/device/state
  mdadm -A /dev/md1 /dev/sd[a-c] --run --force

When we readd /dev/sda to the array, it started to do recovery.
After offline the other two disks in md1, the recovery have
been interrupted and superblock update info cannot be written
to the offline disks. While the spare disk (/dev/sda) can continue
to update superblock info.

After stopping the array and assemble it, we found the array
run fail, with the follow kernel message:

[  172.986064] md: kicking non-fresh sdb from array!
[  173.004210] md: kicking non-fresh sdc from array!
[  173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[  173.022406] md1: failed to create bitmap (-5)
[  173.023466] md: md1 stopped.

Since both sdb and sdc have the value of 'sb->events' smaller than
that in sda, they have been kicked from the array. However, the only
remained disk sda is in 'spare' state before stop and it cannot be
added to conf->mirrors[] array. In the end, raid array assemble
and run fail.

In fact, we can use the older disk sdb or sdc to assemble the array.
That means we should not choose the 'spare' disk as the fresh disk in
analyze_sbs().

To fix the problem, we do not compare superblock events when it is
a spare disk, as same as validate_super.
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

432bc300

md: return -ENODEV if rdev has no mddev assigned · d6d6b831

由 Pawel Baldysiak 提交于 3月 27, 2019

commit c42d3240990814eec1e4b2b93fa0487fc4873aed upstream.

Mdadm expects that setting drive as faulty will fail with -EBUSY only if
this operation will cause RAID to be failed. If this happens, it will
try to stop the array. Currently -EBUSY might also be returned if rdev
is in the middle of the removal process - for example there is a race
with mdmon that already requested the drive to be failed/removed.

If rdev does not contain mddev, return -ENODEV instead, so the caller
can distinguish between those two cases and behave accordingly.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NPawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

d6d6b831

md/raid10: Fix raid10 replace hang when new added disk faulty · 52c7cc01

由 Alex Wu 提交于 9月 21, 2018

commit ee37d7314a32ab6809eacc3389bad0406c69a81f upstream.

[Symptom]

Resync thread hang when new added disk faulty during replacing.

[Root Cause]

In raid10_sync_request(), we expect to issue a bio with callback
end_sync_read(), and a bio with callback end_sync_write().

In normal situation, we will add resyncing sectors into
mddev->recovery_active when raid10_sync_request() returned, and sub
resynced sectors from mddev->recovery_active when end_sync_write()
calls end_sync_request().

If new added disk, which are replacing the old disk, is set faulty,
there is a race condition:
    1. In the first rcu protected section, resync thread did not detect
       that mreplace is set faulty and pass the condition.
    2. In the second rcu protected section, mreplace is set faulty.
    3. But, resync thread will prepare the read object first, and then
       check the write condition.
    4. It will find that mreplace is set faulty and do not have to
       prepare write object.
This cause we add resync sectors but never sub it.

[How to Reproduce]

This issue can be easily reproduced by the following steps:
    mdadm -C /dev/md0 --assume-clean -l 10 -n 4 /dev/sd[abcd]
    mdadm /dev/md0 -a /dev/sde
    mdadm /dev/md0 --replace /dev/sdd
    sleep 1
    mdadm /dev/md0 -f /dev/sde

[How to Fix]

This issue can be fixed by using local variables to record the result
of test conditions. Once the conditions are satisfied, we can make sure
that we need to issue a bio for read and a bio for write.

Previous 'commit 24afd80d ("md/raid10: handle recovery of
replacement devices.")' will also check whether bio is NULL, but leave
the comment saying that it is a pointless test. So we remove this dummy
check.
Reported-by: NAlex Chen <alexchen@synology.com>
Reviewed-by: NAllen Peng <allenpeng@synology.com>
Reviewed-by: NBingJing Chang <bingjingc@synology.com>
Signed-off-by: NAlex Wu <alexwu@synology.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

52c7cc01

alinux: mm, memcg: record latency of memcg wmark reclaim · 40969475

由 Xu Yu 提交于 12月 28, 2019

The memcg background async page reclaim, a.k.a, memcg kswapd, is
implemented with a dedicated unbound workqueue currently.

However, memcg kswapd will run too frequently, resulting in high
overhead, page cache thrashing, frequent dirty page writeback, etc., due
to improper memcg memory.wmark_ratio, unreasonable memcg memor capacity,
or even abnormal memcg memory usage.

We need to find out the problematic memcg(s) where memcg kswapd
introduces significant overhead.

This records the latency of each run of memcg kswapd work, and then
aggregates into the exstat of per memcg.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

40969475

cpuidle: governor: Add new governors to cpuidle_governors again · 8e22a1fb

由 Rafael J. Wysocki 提交于 3月 12, 2019

commit 22782b3f9bb8ae21c710e2880db21bc729771e92 upstream

After commit 61cb5758d3c4 ("cpuidle: Add cpuidle.governor= command
line parameter") new cpuidle governors are not added to the list
of available governors, so governor selection via sysfs doesn't
work as expected (even though it is rarely used anyway).

Fix that by making cpuidle_register_governor() add new governors to
cpuidle_governors again.

Fixes: 61cb5758d3c4 ("cpuidle: Add cpuidle.governor= command line parameter")
Reported-by: NKees Cook <keescook@chromium.org>
Cc: 5.0+ <stable@vger.kernel.org> # 5.0+
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

8e22a1fb

kvm: x86: add host poll control msrs · 1e53713f

由 Marcelo Tosatti 提交于 6月 03, 2019

commit 2d5ba19bdfef4dd06add144eb04287ee98409f75 upstream

Add an MSRs which allows the guest to disable
host polling (specifically the cpuidle-haltpoll,
when performing polling in the guest, disables
host side polling).
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

1e53713f

KVM: arm64: Opportunistically turn off WFI trapping when using direct LPI injection · ff9b66b5

由 Marc Zyngier 提交于 11月 07, 2019

commit ef2e78ddadbb939ce79553b10dee0131d65d8f3e upstream.

Just like we do for WFE trapping, it can be useful to turn off
WFI trapping when the physical CPU is not oversubscribed (that
is, the vcpu is the only runnable process on this CPU) *and*
that we're using direct injection of interrupts.

The conditions are reevaluated on each vcpu_load(), ensuring that
we don't switch to this mode on a busy system.

On a GICv4 system, this has the effect of reducing the generation
of doorbell interrupts to zero when the right conditions are
met, which is a huge improvement over the current situation
(where the doorbells are screaming if the CPU ever hits a
blocking WFI).
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Reviewed-by: NZenghui Yu <yuzenghui@huawei.com>
Reviewed-by: NChristoffer Dall <christoffer.dall@arm.com>
Link: https://lore.kernel.org/r/20191107160412.30301-3-maz@kernel.orgSigned-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com>
Acked-by: NZou Cao <zoucao@linux.alibaba.com>

ff9b66b5

KVM: vgic-v4: Track the number of VLPIs per vcpu · 27840020

由 Marc Zyngier 提交于 11月 07, 2019

commit 5bd90b0989731520f2cdcfbbe467f1271f3cc803 upstream.

In order to find out whether a vcpu is likely to be the target of
VLPIs (and to further optimize the way we deal with those), let's
track the number of VLPIs a vcpu can receive.

This gets implemented with an atomic variable that gets incremented
or decremented on map, unmap and move of a VLPI.
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Reviewed-by: NZenghui Yu <yuzenghui@huawei.com>
Reviewed-by: NChristoffer Dall <christoffer.dall@arm.com>
Link: https://lore.kernel.org/r/20191107160412.30301-2-maz@kernel.orgSigned-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com>
Acked-by: NZou Cao <zoucao@linux.alibaba.com>

27840020

KVM: arm64: vgic-v4: Move the GICv4 residency flow to be driven by vcpu_load/put · 42993070

由 Marc Zyngier 提交于 10月 27, 2019

commit 8e01d9a396e6db153d94a6004e6473d9ff251a6a upstream.

When the VHE code was reworked, a lot of the vgic stuff was moved around,
but the GICv4 residency code did stay untouched, meaning that we come
in and out of residency on each flush/sync, which is obviously suboptimal.

To address this, let's move things around a bit:

- Residency entry (flush) moves to vcpu_load
- Residency exit (sync) moves to vcpu_put
- On blocking (entry to WFI), we "put"
- On unblocking (exit from WFI), we "load"

Because these can nest (load/block/put/load/unblock/put, for example),
we now have per-VPE tracking of the residency state.

Additionally, vgic_v4_put gains a "need doorbell" parameter, which only
gets set to true when blocking because of a WFI. This allows a finer
control of the doorbell, which now also gets disabled as soon as
it gets signaled.
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20191027144234.8395-2-maz@kernel.orgSigned-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com>
Acked-by: NZou Cao <zoucao@linux.alibaba.com>

42993070

EDAC, skx: Retrieve and print retry_rd_err_log registers · a20889d9

由 Tony Luck 提交于 8月 15, 2019

commit e80634a75aba90e7485cd1fdb463fcac5d45f14d upstream

Skylake logs some additional useful information in per-channel
registers in addition the the architectural status/addr/misc
logged in the machine check bank.

Pick up this information and add it to the EDAC log:

retry_rd_err_[five 32-bit register values]

Sorry, no definitions for these registers. OEMs and DIMM vendors
will be able to use them to isolate which cells in the DIMM are
causing problems.

correrrcnt[per rank corrected error counts]

Note that if additional errors are logged while these registers are
being read, you may see a jumble of values some from earlier errors,
others from later errors (since the registers report the most recent
logged error). The correrrcnt registers provide error counts per possible
rank. If these counts only change by one since the previous error logged
for this channel, then it is safe to assume that the registers logged
provide a coherent view of one error.

With this change EDAC logs look like this:

EDAC MC4: 1 CE memory read error on CPU_SrcID#2_MC#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8f26018 offset:0x0 grain:32 syndrome:0x0 - err_code:0x0101:0x0091 socket:2 imc:0 rank:0 bg:0 ba:0 row:0x1f880 col:0x200 retry_rd_err_log[0001a209 00000000 00000001 04800001 0001f880] correrrcnt[0001 0000 0000 0000 0000 0000 0000 0000])
Acked-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: Nzhaobing <zhaobing@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

a20889d9

tools headers uapi: Sync asm-generic/mman-common.h with the kernel · 5c1675fc

由 Arnaldo Carvalho de Melo 提交于 9月 27, 2019

commit b1ba55cf1cfb9f3e0e00d743534684a25bf66d28 upstream

To pick the changes from:

  1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
  9c276cc65a58 ("mm: introduce MADV_COLD")

That result in these changes in the tools:

  $ tools/perf/trace/beauty/madvise_behavior.sh > before
  $ cp include/uapi/asm-generic/mman-common.h tools/include/uapi/asm-generic/mman-common.h
  $ git diff
  diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
  index 63b1f506ea67..c160a5354eb6 100644
  --- a/tools/include/uapi/asm-generic/mman-common.h
  +++ b/tools/include/uapi/asm-generic/mman-common.h
  @@ -67,6 +67,9 @@
   #define MADV_WIPEONFORK 18             /* Zero memory on fork, child only */
   #define MADV_KEEPONFORK 19             /* Undo MADV_WIPEONFORK */

  +#define MADV_COLD      20              /* deactivate these pages */
  +#define MADV_PAGEOUT   21              /* reclaim these pages */
  +
   /* compatibility flags */
   #define MAP_FILE       0

  $ tools/perf/trace/beauty/madvise_behavior.sh > after
  $ diff -u before after
  --- before	2019-09-27 11:29:43.346320100 -0300
  +++ after	2019-09-27 11:30:03.838570439 -0300
  @@ -16,6 +16,8 @@
   	[17] = "DODUMP",
   	[18] = "WIPEONFORK",
   	[19] = "KEEPONFORK",
  +	[20] = "COLD",
  +	[21] = "PAGEOUT",
   	[100] = "HWPOISON",
   	[101] = "SOFT_OFFLINE",
   };
  $

I.e. now when madvise gets those behaviours as args, it will be able to
translate from the number to a human readable string.

This addresses the following perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/mman-common.h' differs from latest version at 'include/uapi/asm-generic/mman-common.h'
  diff -u tools/include/uapi/asm-generic/mman-common.h include/uapi/asm-generic/mman-common.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Luis Cláudio Gonçalves <lclaudio@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lkml.kernel.org/n/tip-n40y6c4sa49p29q6sl8w3ufx@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

5c1675fc

mm: fix trying to reclaim unevictable lru page when calling madvise_pageout · 88ec97ec

由 zhong jiang 提交于 11月 15, 2019

commit 82072962973008201b817fae1896512977dd5083 upstream

Recently, I hit the following issue when running upstream.

  kernel BUG at mm/vmscan.c:1521!
  invalid opcode: 0000 [#1] SMP KASAN PTI
  CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
  RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
  Call Trace:
   reclaim_pages+0x499/0x800 mm/vmscan.c:2188
   madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
   walk_pmd_range mm/pagewalk.c:53 [inline]
   walk_pud_range mm/pagewalk.c:112 [inline]
   walk_p4d_range mm/pagewalk.c:139 [inline]
   walk_pgd_range mm/pagewalk.c:166 [inline]
   __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
   walk_page_range+0x179/0x310 mm/pagewalk.c:349
   madvise_pageout_page_range mm/madvise.c:506 [inline]
   madvise_pageout+0x1f0/0x330 mm/madvise.c:542
   madvise_vma mm/madvise.c:931 [inline]
   __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
   do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

madvise_pageout() accesses the specified range of the vma and isolates
them, then runs shrink_page_list() to reclaim its memory.  But it also
isolates the unevictable pages to reclaim.  Hence, we can catch the
cases in shrink_page_list().

The root cause is that we scan the page tables instead of specific LRU
list.  and so we need to filter out the unevictable lru pages from our
end.

Link: http://lkml.kernel.org/r/1572616245-18946-1-git-send-email-zhongjiang@huawei.com
Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMinchan Kim <minchan@kernel.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

88ec97ec

mm: factor out common parts between MADV_COLD and MADV_PAGEOUT · b6a18a3c

由 Minchan Kim 提交于 9月 25, 2019

commit d616d5126503967bf365db0711ee3c78b356efe9 upstream

There are many common parts between MADV_COLD and MADV_PAGEOUT.
This patch factor them out to save code duplication.

Link: http://lkml.kernel.org/r/20190726023435.214162-6-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: kbuild test robot <lkp@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

b6a18a3c

mm: introduce MADV_PAGEOUT · 23757dcc

由 Minchan Kim 提交于 9月 25, 2019

commit 1a4e58cce84ee88129d5d49c064bd2852b481357 upstream

When a process expects no accesses to a certain memory range for a long
time, it could hint kernel that the pages can be reclaimed instantly but
data should be preserved for future use.  This could reduce workingset
eviction so it ends up increasing performance.

This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
MADV_PAGEOUT can be used by a process to mark a memory range as not
expected to be used for a long time so that kernel reclaims *any LRU*
pages instantly.  The hint can help kernel in deciding which pages to
evict proactively.

A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
intentionally because it's automatically bounded by PMD size.  If PMD
size(e.g., 256) makes some trouble, we could fix it later by limit it to
SWAP_CLUSTER_MAX[1].

- man-page material

MADV_PAGEOUT (since Linux x.x)

Do not expect access in the near future so pages in the specified
regions could be reclaimed instantly regardless of memory pressure.
Thus, access in the range after successful operation could cause
major page fault but never lose the up-to-date contents unlike
MADV_DONTNEED. Pages belonging to a shared mapping are only processed
if a write access is allowed for the calling process.

MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
VM_PFNMAP pages.

[1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/

[minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
  Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
Reported-by: Nkbuild test robot <lkp@intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

23757dcc

mm: introduce MADV_COLD · 1af766e8

由 Minchan Kim 提交于 9月 25, 2019

commit 9c276cc65a58faf98be8e56962745ec99ab87636 upstream

Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.

- Background

The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot
start.  While we continually try to improve the performance of cold
starts, hot starts will always be significantly less power hungry as well
as faster so we are trying to make hot start more likely than cold start.

To increase hot start, Android userspace manages the order that apps
should be killed in a process called ActivityManagerService.
ActivityManagerService tracks every Android app or service that the user
could be interacting with at any time and translates that into a ranked
list for lmkd(low memory killer daemon).  They are likely to be killed by
lmkd if the system has to reclaim memory.  In that sense they are similar
to entries in any other cache.  Those apps are kept alive for
opportunistic performance improvements but those performance improvements
will vary based on the memory requirements of individual workloads.

- Problem

Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap.  Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a
cached process to be killed(we measured performance swapping out vs.
zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
times faster even though we use zram which is much faster than real
storage) so kill from lmkd will often satisfy the high zone watermark,
resulting in very few pages actually being moved to swap.

- Approach

The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd by
reclaiming apps as soon as they entered the cached state.  Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.

To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly.  These new
options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
ways to gain some free memory space.  MADV_PAGEOUT is similar to
MADV_DONTNEED in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed immediately; MADV_COLD is similar
to MADV_FREE in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed when memory pressure rises.

This patch (of 5):

When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use.  This could reduce
workingset eviction so it ends up increasing performance.

This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future.  The hint can help kernel in deciding which
pages to evict early during memory pressure.

It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

	active file page -> inactive file LRU
	active anon page -> inacdtive anon LRU

Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault.  Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages.  It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied.  Even, it could give a bonus to make
them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost.  Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU.  Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device.  Let's start simpler way without adding
complexity at this moment.  However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.

* man-page material

MADV_COLD (since Linux x.x)

Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies.  In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.

MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.

[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
Reported-by: Nkbuild test robot <lkp@intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

1af766e8

mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM · a0747c91

由 Minchan Kim 提交于 9月 25, 2019

commit 8940b34a4e082ae11498ddae8432f2ac07685d1c upstream

The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
as default.  It is for preventing to reclaim dirty pages when CMA try to
migrate pages.  Strictly speaking, we don't need it because CMA didn't
allow to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.

Moreover, it has a problem to prevent anonymous pages's swap out even
though force_reclaim = true in shrink_page_list on upcoming patch.  So
this patch makes references's default value to PAGEREF_RECLAIM and rename
force_reclaim with ignore_references to make it more clear.

This is a preparatory work for next patch.

Link: http://lkml.kernel.org/r/20190726023435.214162-3-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: kbuild test robot <lkp@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

a0747c91

tools build: Check if gettid() is available before providing helper · 27a374d1

由 Arnaldo Carvalho de Melo 提交于 3月 05, 2020

commit 4541a8bb13a86e504416a13360c8dc64d2fd612a upstream

Laura reported that the perf build failed in fedora when we got a glibc
that provides gettid(), which I reproduced using fedora rawhide with the
glibc-devel-2.29.9000-26.fc31.x86_64 package.

Add a feature check to avoid providing a gettid() helper in such
systems.

On a fedora rawhide system with this patch applied we now get:

  [root@7a5f55352234 perf]# grep gettid /tmp/build/perf/FEATURE-DUMP
  feature-gettid=1
  [root@7a5f55352234 perf]# cat /tmp/build/perf/feature/test-gettid.make.output
  [root@7a5f55352234 perf]# ldd /tmp/build/perf/feature/test-gettid.bin
          linux-vdso.so.1 (0x00007ffc6b1f6000)
          libc.so.6 => /lib64/libc.so.6 (0x00007f04e0a74000)
          /lib64/ld-linux-x86-64.so.2 (0x00007f04e0c47000)
  [root@7a5f55352234 perf]# nm /tmp/build/perf/feature/test-gettid.bin | grep -w gettid
                   U gettid@@GLIBC_2.30
  [root@7a5f55352234 perf]#

While on a fedora:29 system:

  [acme@quaco perf]$ grep gettid /tmp/build/perf/FEATURE-DUMP
  feature-gettid=0
  [acme@quaco perf]$ cat /tmp/build/perf/feature/test-gettid.make.output
  test-gettid.c: In function ‘main’:
  test-gettid.c:8:9: error: implicit declaration of function ‘gettid’; did you mean ‘getgid’? [-Werror=implicit-function-declaration]
    return gettid();
           ^~~~~~
           getgid
  cc1: all warnings being treated as errors
  [acme@quaco perf]$
Reported-by: NLaura Abbott <labbott@redhat.com>
Tested-by: NLaura Abbott <labbott@redhat.com>
Acked-by: NJiri Olsa &lt;jolsa@kernel.org&gt;Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lkml.kernel.org/n/tip-yfy3ch53agmklwu9o7rlgf9c@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com>
Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>

27a374d1

alinux: mm: add proc interface to control context readahead · 2e38a0f2

由 Xiaoguang Wang 提交于 3月 09, 2020

For some workloads whose io activities are mostly random, context
readahead feature can introduce unnecessary io read operations, which
will impact app's performance. Context readahead's algorithm is
straightforward and not that smart.

This patch adds "/proc/sys/vm/enable_context_readahead" to control
whether to disable or enable this feature. Currently we enable context
readahead default, user can echo 0 to /proc/sys/vm/enable_context_readahead
to disable context readahead.

We also have tested mongodb's performance in 'random point select' case,
With context readahead enabled:
  mongodb eps 12409
With context readahead disabled:
  mongodb eps 14443
About 16% performance improvement.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

2e38a0f2

alinux: hookers: add arm64 dependency · db153a10

由 Zou Cao 提交于 12月 13, 2019

Now hookers has been support in arm64, add the Kconfig depend for
arm64, otherwise it won't be built.
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>

db153a10

alinux: Hookers: add arm64 support · bc56b7b3

由 Zou Cao 提交于 12月 01, 2019

arm64 has't global write protect set, here we remove the write
proctect in pmd table and update the hook data.
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>

bc56b7b3

alinux: arm64: use __kernel_text_address to replace kthread_return_to_user · 64259ab4

由 Zou Cao 提交于 3月 06, 2020

We don't need to use kthread_return_to_user to tell unwind it is kernel
thread, we can use __kernel_text_address, it is a normal way in other
arch like x86/ppc.
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

64259ab4

arm64: reliable stacktraces · 46ad7da7

由 Torsten Duwe 提交于 3月 06, 2020

cherry-picked from: https://patchwork.kernel.org/patch/10657429/

Enhance the stack unwinder so that it reports whether it had to stop
normally or due to an error condition; unwind_frame() will report
continue/error/normal ending and walk_stackframe() will pass that
info. __save_stack_trace() is used to check the validity of a stack;
save_stack_trace_tsk_reliable() can now trivially be implemented.
Modify arch/arm64/kernel/time.c as the only external caller so far
to recognise the new semantics.

I had to introduce a marker symbol kthread_return_to_user to tell
the normal origin of a kernel thread.
Signed-off-by: NTorsten Duwe <duwe@suse.de>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

46ad7da7

alinux: arm64: add livepatch support · 7d9b185c

由 Zou Cao 提交于 3月 06, 2020

Now we support FTRACE_WITH_REGS with -fpatchable-function-entry, here
enable the livepatch support depend on FTRACE_WITH_REGS.

Use task flag bit 6 to track patch transisiton state for the consistency
model. Add it to the work mask so it gets cleared on all kernel exits to
userland.

Tell livepatch regs->pc + 2*AARCH64_INSN_SIZE is the place to change the
return address.

these codes have a big change against reference link, beacause we use new
gcc featrue.

References:
https://patchwork.kernel.org/patch/10657431/

Based-on-code-from: Torsten Duwe <duwe@suse.de>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

7d9b185c

efi: Make efi_rts_work accessible to efi page fault handler · 95fc4624

由 Sai Praneeth 提交于 9月 11, 2018

[ Upstream commit 9dbbedaa6171247c4c7c40b83f05b200a117c2e0 ]

After the kernel has booted, if any accesses by firmware causes a page
fault, the efi page fault handler would freeze efi_rts_wq and schedules
a new process. To do this, the efi page fault handler needs
efi_rts_work. Hence, make it accessible.

There will be no race conditions in accessing this structure, because
all the calls to efi runtime services are already serialized.

[ Wen: This patch also fixes a memory corruption:
       #define efi_queue_work(_rts, _arg1, _arg2, _arg3, _arg4, _arg5)\
       ({                                                             \
        struct efi_runtime_work efi_rts_work;                           \
       …
        init_completion(&efi_rts_work.efi_rts_comp);                    \
        INIT_WORK(&efi_rts_work.work, efi_call_rts);                    \
       …

       efi_rts_work is on the stack, registering it to workqueue will cause
       the following error:

       ODEBUG: object (____ptrval____) is on stack (____ptrval____),
       but NOT annotated.
       ------------[ cut here ]------------
       WARNING: CPU: 6 PID: 1 at lib/debugobjects.c:368
       __debug_object_init+0x218/0x538
       Modules linked in:
       CPU: 6 PID: 1 Comm: swapper/0 Tainted: G        W         4.19.91 #19
       …
       Call trace:
       __debug_object_init+0x218/0x538
       debug_object_init+0x20/0x28
       __init_work+0x34/0x58
       virt_efi_get_time.part.5+0x6c/0x12c
       virt_efi_get_time+0x4c/0x58
       efi_read_time+0x40/0x9c
       __rtc_read_time+0x50/0x118
       rtc_read_time+0x60/0x1f0
       rtc_hctosys+0x74/0x124
       do_one_initcall+0xac/0x3d4
       kernel_init_freeable+0x49c/0x59c
       kernel_init+0x18/0x110 ]
Tested-by: NBhupesh Sharma <bhsharma@redhat.com>
Suggested-by: NMatt Fleming <matt@codeblueprint.co.uk>
Based-on-code-from: Ricardo Neri <ricardo.neri@intel.com>
Signed-off-by: NSai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Fixes: 3eb420e7 ("efi: Use a work queue to invoke EFI Runtime Services")
Signed-off-by: NWen Yang <wenyang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

95fc4624

netfilter: conntrack: udp: set stream timeout to 2 minutes · 0127f777

由 Florian Westphal 提交于 12月 18, 2018

commit 294304e4c522d797b7ea8200ab74354843fa68e9 upstream

We have no explicit signal when a UDP stream has terminated, peers just
stop sending.

For suspected stream connections a timeout of two minutes is sane to keep
NAT mapping alive a while longer.

It matches tcp conntracks 'timewait' default timeout value.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
Acked-by: NDust Li <dust.li@linux.alibaba.com>

0127f777

netfilter: conntrack: udp: only extend timeout to stream mode after 2s · 617c7456

由 Florian Westphal 提交于 12月 06, 2018

commit d535c8a69c1924e70186d80be0a9cecaf475f166 upstream

Currently DNS resolvers that send both A and AAAA queries from same source port
can trigger stream mode prematurely, which results in non-early-evictable conntrack entry
for three minutes, even though DNS requests are done in a few milliseconds.

Add a two second grace period where we continue to use the ordinary
30-second default timeout. Its enough for DNS request/response traffic,
even if two request/reply packets are involved.

ASSURED is still set, else conntrack (and thus a possible
NAT mapping ...) gets zapped too in case conntrack table runs full.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
Acked-by: NDust Li <dust.li@linux.alibaba.com>

617c7456

iomap: Allow forcing of waiting for running DIO in iomap_dio_rw() · 96d39291

由 Jan Kara 提交于 10月 15, 2019

commit 13ef954445df4fd1d7c003a500ec5ce49573e14b upstream

Notes from Xiaoguang Wang:
    Indeed this patch should be appled before "ext4: introduce direct I/O
read using iomap infrastructure", but given that we have already appled
"ext4: introduce direct I/O read using iomap infrastructure" previously,
we need to update iomap_dio_rw() calls with the new argument  in ext4.

Filesystems do not support doing IO as asynchronous in some cases. For
example in case of unaligned writes or in case file size needs to be
extended (e.g. for ext4). Instead of forcing filesystem to wait for AIO
in such cases, add argument to iomap_dio_rw() which makes the function
wait for IO completion. This also results in executing
iomap_dio_complete() inline in iomap_dio_rw() providing its return value
to the caller as for ordinary sync IO.
Signed-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

96d39291

io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL · a3a1829e

由 Xiaoguang Wang 提交于 2月 26, 2020

commit bdcd3eab2a9ae0ac93f27275b6895dd95e5bf360 upstream

After making ext4 support iopoll method:
  let ext4_file_operations's iopoll method be iomap_dio_iopoll(),
we found fio can easily hang in fio_ioring_getevents() with below fio
job:
    rm -f testfile; sync;
    sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread
-rw=write -ioengine=io_uring  -hipri=1 -sqthread_poll=1 -direct=1
-bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting
with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled.

There are two issues that results in this hang, one reason is that
when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio
does not use io_uring_enter to get completed events, it relies on
kernel io_sq_thread to poll for completed events.

Another reason is that there is a race: when io_submit_sqes() in
io_sq_thread() submits a batch of sqes, variable 'inflight' will
record the number of submitted reqs, then io_sq_thread will poll for
reqs which have been added to poll_list. But note, if some previous
reqs have been punted to io worker, these reqs will won't be in
poll_list timely. io_sq_thread() will only poll for a part of previous
submitted reqs, and then find poll_list is empty, reset variable
'inflight' to be zero. If app just waits these deferred reqs and does
not wake up io_sq_thread again, then hang happens.

For app that entirely relies on io_sq_thread to poll completed requests,
let io_iopoll_req_issued() wake up io_sq_thread properly when adding new
element to poll_list, and when io_sq_thread prepares to sleep, check
whether poll_list is empty again, if not empty, continue to poll.
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a3a1829e

cpuidle: haltpoll: Take 'idle=' override into account · 0e2f72bd

由 Zhenzhong Duan 提交于 10月 18, 2019

commit be721b9b24394c088326384d44dcb9ec4b321aae upstream

Currenly haltpoll isn't aware of the 'idle=' override, the priority is
'idle=poll' > haltpoll > 'idle=halt'. When 'idle=poll' is used, cpuidle
driver is bypassed but current_driver in sys still shows 'haltpoll'.

When 'idle=halt' is used, haltpoll takes precedence and makes
'idle=halt' have no effect.

Add a check to prevent the haltpoll driver from loading if 'idle=' is
present.
Signed-off-by: NZhenzhong Duan <zhenzhong.duan@oracle.com>
Co-developed-by: NJoao Martins <joao.m.martins@oracle.com>
[ rjw: Subject ]
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

0e2f72bd

cpuidle-haltpoll: do not set an owner to allow modunload · 6805e4ba

由 Joao Martins 提交于 9月 08, 2019

commit b0513562cf8c97035a60b6463b459edf8b3c9e5d upstream

cpuidle-haltpoll can be built as a module to allow optional late load.
Given we are setting @owner to THIS_MODULE, cpuidle will attempt to grab a
module reference every time a cpuidle_device is registered -- so
essentially all online cpus get a reference.

This prevents for the module to be unloaded later, which makes the
module_exit callback entirely unused. Thus remove the @owner and allow
module to be unloaded.

Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

6805e4ba

cpuidle-haltpoll: return -ENODEV on modinit failure · f9706f22

由 Joao Martins 提交于 9月 08, 2019

commit 26ee75559bd10f07b5b99e40e05df5f0d7d7b7ec upstream

When a user loads cpuidle-haltpoll on a non KVM guest the module will
successfully load, even though idle driver registration didn't take
place.

We should instead return -ENODEV signaling the user that the driver can't
be loaded, like other error paths in haltpoll_init().  An example of such
error paths is when we return -EBUSY when attempting to register an idle
driver when it had one already (e.g. intel_idle loads at boot and then we
attempt to insert module cpuidle-haltpoll).

Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

f9706f22

cpuidle-haltpoll: set haltpoll as preferred governor · aee4b74e

由 Joao Martins 提交于 9月 08, 2019

commit c194c2ad74696ce686a2ee1239f44d9750cb5e5c upstream

Right now, guest current governors have the following ratings:

 * ladder            -> 10
 * teo               -> 19
 * menu              -> 20
 * haltpoll          -> 21
 * ladder + nohz=off -> 25

haltpoll governor got introduced and it is now the default governor given
its highest rating -- with ladder+nohz being the exception -- regardless of
idle driver in the guest. An example of an undesirable case is x86 KVM
guests with MWAIT which have intel_idle registered first, and consequently
will have haltpoll be used as governor which would get limited to a poll
state and state 1 and the other states wouldn't get used.

To keep the previous defaults we decrease rating of governor to 9 (below
current lowest rating) and thus rely on @governor switch on
cpuidle_register_driver() to tie in haltpoll idle driver and governor
together.
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

aee4b74e

cpuidle: allow governor switch on cpuidle_register_driver() · 0176005e

由 Joao Martins 提交于 9月 08, 2019

commit 11c59eae6633b8a7e77b8ee1cf908964d80c78cd upstream

The recently introduced haltpoll driver is largely only useful with
haltpoll governor. To allow drivers to associate with a particular idle
behaviour, add a @governor property to 'struct cpuidle_driver' and thus
allow a cpuidle driver to switch to a *preferred* governor on idle driver
registration. We save the previous governor, and when an idle driver is
unregistered we switch back to that.

The @governor can be overridden by cpuidle.governor= boot param or
alternatively be ignored if the governor doesn't exist.
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

0176005e

cpuidle-haltpoll: vcpu hotplug support · 1e18beca

由 Marcelo Tosatti 提交于 7月 03, 2019

commit 97d3eb9da84cae0548359b0aecb8619faad003b7 upstream

When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus
past the online ones and thus fail to register the idle driver.
This is because cpuidle_add_sysfs() will return with -ENODEV as a
consequence from get_cpu_device() return no device for a non-existing
CPU.

Instead switch to cpuidle_register_driver() and manually register each
of the present cpus through cpuhp_setup_state() callbacks and future
ones that get onlined or offlined. This mimmics similar logic that
intel_idle does.

Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

1e18beca

cpuidle: add haltpoll governor · 7b905bca

由 Marcelo Tosatti 提交于 7月 03, 2019

commit 2cffe9f6b96fece065ee8522673c90e92ef2085d upstream

The cpuidle_haltpoll governor, in conjunction with the haltpoll cpuidle
driver, allows guest vcpus to poll for a specified amount of time before
halting.
This provides the following benefits to host side polling:

        1) The POLL flag is set while polling is performed, which allows
           a remote vCPU to avoid sending an IPI (and the associated
           cost of handling the IPI) when performing a wakeup.

        2) The VM-exit cost can be avoided.

The downside of guest side polling is that polling is performed
even with other runnable tasks in the host.

Results comparing halt_poll_ns and server/client application
where a small packet is ping-ponged:

host                                        --> 31.33
halt_poll_ns=300000 / no guest busy spin    --> 33.40   (93.8%)
halt_poll_ns=0 / guest_halt_poll_ns=300000  --> 32.73   (95.7%)

For the SAP HANA benchmarks (where idle_spin is a parameter
of the previous version of the patch, results should be the
same):

hpns == halt_poll_ns

                          idle_spin=0/   idle_spin=800/    idle_spin=0/
                          hpns=200000    hpns=0            hpns=800000
DeleteC06T03 (100 thread) 1.76           1.71 (-3%)        1.78   (+1%)
InsertC16T02 (100 thread) 2.14           2.07 (-3%)        2.18   (+1.8%)
DeleteC00T01 (1 thread)   1.34           1.28 (-4.5%)      1.29   (-3.7%)
UpdateC00T03 (1 thread)   4.72           4.18 (-12%)       4.53   (-5%)
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

7b905bca

add cpuidle-haltpoll driver · 6d2ef95f

由 Marcelo Tosatti 提交于 7月 03, 2019

commit fa86ee90eb1111267de67cb4272b5ce711f18cbb upstream

Add a cpuidle driver that calls the architecture default_idle routine.

To be used in conjunction with the haltpoll governor.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

6d2ef95f

governors: unify last_state_idx · dc59f8b0

由 Marcelo Tosatti 提交于 7月 03, 2019

commit 7d4daeedd575bbc3c40c87fc6708a8b88c50fe7e upstream

Since this field is shared by all governors, move it to
cpuidle device structure.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

dc59f8b0

cpuidle: menu: Do not update last_state_idx in menu_select() · 5e083c5c

由 Rafael J. Wysocki 提交于 10月 02, 2018

commit eb40a380bff28f84b6583bba6786b46ef26ef548 upstream

It is not necessary to update data->last_state_idx in menu_select()
as it only is used in menu_update() which only runs when
data->needs_update is set and that is set only when updating
data->last_state_idx in menu_reflect().

Accordingly, drop the update of data->last_state_idx from
menu_select() and get rid of the (now redundant) "out" label
from it.

No intentional behavior changes.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

5e083c5c

cpuidle: add poll_limit_ns to cpuidle_device structure · 2163e221

由 Marcelo Tosatti 提交于 7月 03, 2019

commit 259231a045616c4101d023a8f4dcc8379af265a6 upstream

Add a poll_limit_ns variable to cpuidle_device structure.

Calculate and configure it in the new cpuidle_poll_time
function, in case its zero.

Individual governors are allowed to override this value.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

2163e221

cpuidle: poll_state: Fix default time limit · a72c57d7

由 Doug Smythies 提交于 1月 30, 2019

commit 1617971c6616c87185cbc78fa1a86dfc70dd16b6 upstream

The default time is declared in units of microsecnds,
but is used as nanoseconds, resulting in significant
accounting errors for idle state 0 time when all idle
states deeper than 0 are disabled.

Under these unusual conditions, we don't really care
about the poll time limit anyhow.

Fixes: 800fb34a99ce ("cpuidle: poll_state: Disregard disable idle states")
Signed-off-by: NDoug Smythies <dsmythies@telus.net>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

a72c57d7

openanolis / cloud-kernel 10 个月 前同步成功

openanolis / cloud-kernel
10 个月前同步成功