summaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)Author
2015-10-27x86/ioapic: Prevent NULL pointer dereference in setup_ioapic_dest()Werner Pawlitschko
Commit 4857c91f0d19 changed the way how irq affinity is setup in setup_ioapic_dest() from using the core helper function to unconditionally calling the irq_set_affinity() callback of the underlying irq chip. That results in a NULL pointer dereference for the rare case where the underlying irq chip is lapic_chip which has no irq_set_affinity() callback. lapic_chip is occasionally used for the timer interrupt (irq 0). The fix is simple: Check the availability of the callback instead of calling it unconditionally. Fixes: 4857c91f0d19 "x86/ioapic: Force affinity setting in setup_ioapic_dest()" Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org
2015-10-26x86/dma-mapping: Fix arch_dma_alloc_attrs() oops with NULL devVille Syrjälä
Commit 6894258eda2f broke drivers that pass NULL as the device pointer to dma_alloc. The reason is that arch_dma_alloc_attrs() now calls dma_alloc_coherent_gfp_flags() which in turn calls dma_alloc_coherent_mask(), where the device pointer is dereferenced unconditionally. Fix things by moving the ISA DMA fallback device assignment before the call to dma_alloc_coherent_gfp_flags(). Fixes: 6894258eda2f ("dma-mapping: consolidate dma_{alloc,free}_{attrs,coherent}") Reported-and-tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/1445807503-8920-1-git-send-email-ville.syrjala@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-10-25Merge branches 'acpi-scan', 'acpi-tables', 'acpi-ec' and 'acpi-assorted'Rafael J. Wysocki
* acpi-scan: ACPI / scan: use kstrdup_const() in acpi_add_id() ACPI / scan: constify struct acpi_hardware_id::id ACPI / scan: constify first argument of struct acpi_scan_handler::match * acpi-tables: ACPI / tables: test the correct variable x86, ACPI: Handle apic/x2apic entries in MADT in correct order ACPI / tables: Add acpi_subtable_proc to ACPI table parsers * acpi-ec: ACPI / EC: Fix a race issue in acpi_ec_guard_event() ACPI / EC: Fix query handler related issues * acpi-assorted: ACPI: change acpi_sleep_proc_init() to return void ACPI: change init_acpi_device_notify() to return void
2015-10-22ftrace: Calculate the correct dyn_ftrace number to report to the userspaceMinfei Huang
Now, ftrace only calculate the dyn_ftrace number in the adding breakpoint loop, not in adding update and finish update loop. Calculate the correct dyn_ftrace, once ftrace reports the failure message to the userspace. Link: http://lkml.kernel.org/r/1442420382-13130-1-git-send-email-mnfhuang@gmail.com Signed-off-by: Minfei Huang <mnfhuang@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-10-21x86/microcode/intel: Move #ifdef DEBUG inside the functionBorislav Petkov
... and save us the stub. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Len Brown <len.brown@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/1445334889-300-6-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21x86/microcode/amd: Remove maintainers from commentsBorislav Petkov
We have the MAINTAINERS file for that. Also, Andreas doesn't have the time for this work anymore. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andreas Herrmann <herrmann.der.user@googlemail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Len Brown <len.brown@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/1445334889-300-5-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21x86/microcode: Remove modularization leftoversBorislav Petkov
Remove the remaining module functionality leftovers. Make "dis_ucode_ldr" an early_param and make it static again. Drop module aliases, autoloading table, description, etc. Bump version number, while at it. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Len Brown <len.brown@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/1445334889-300-4-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21x86/microcode: Merge the early microcode loaderBorislav Petkov
Merge the early loader functionality into the driver proper. The diff is huge but logically, it is simply moving code from the _early.c files into the main driver. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Len Brown <len.brown@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/1445334889-300-3-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21x86/microcode: Unmodularize the microcode driverBorislav Petkov
Make CONFIG_MICROCODE a bool. It was practically a bool already anyway, since early loader was forcing it to =y. Regardless, there's no real reason to have something be a module which gets built-in on the majority of installations out there. And its not like there's noticeable change in functionality - we still can load late microcode - just the module glue disappears. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Len Brown <len.brown@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/1445334889-300-2-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21timers/x86/hpet: Type adjustmentsJan Beulich
Standardize on bool instead of an inconsistent mixture of u8 and plain 'int'. Also use u32 or 'unsigned int' instead of 'unsigned long' when a 32-bit type suffices, generating slightly better code on x86-64. Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/5624E3A002000078000AC49A@prv-mh.provo.novell.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21x86/mce: Fix thermal throttling reporting after kexecAndi Kleen
The per CPU thermal vector init code checks if the thermal vector is already installed and complains and bails out if it is. This happens after kexec, as kernel shut down does not clear the thermal vector APIC register. This causes two problems: 1. So we always do not fully initialize thermal reports after kexec. The CPU is still likely initialized, as the previous kernel should have done it. But we don't set up the software pointer to the thermal vector, so reporting may end up with a unknown thermal interrupt message. 2. Also it complains for every logical CPU, even though the value is actually derived from BP only. The problem is that we end up with one message per CPU, so on larger systems it becomes very noisy and messes up the otherwise nicely formatted CPU bootup numbers in the kernel log. Just remove the check. I checked the code and there's no valid code paths where the thermal init code for a CPU could be called multiple times. Why the kernel does not clean up this value on shutdown: The thermal monitoring is controlled per logical CPU thread. Normal shutdown code is just running on one CPU. To disable it we would need a broadcast NMI to all CPUs on shut down. That's overkill for this. So we just ignore it after kexec. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1445246268-26285-9-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21x86/setup/crash: Check memblock_reserve() retvalBorislav Petkov
memblock_reserve() can fail but the crashkernel reservation code doesn't check that and this can lead the user into believing that the crashkernel region was actually reserved. Make sure we check that return value and we exit early with a failure message in the error case. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Young <dyoung@redhat.com> Reviewed-by: Joerg Roedel <jroedel@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: WANG Chao <chaowang@redhat.com> Cc: jerry_hoemann@hp.com Link: http://lkml.kernel.org/r/1445246268-26285-7-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21x86/setup/crash: Cleanup some moreBorislav Petkov
* Remove unused auto_set variable * Cleanup local function variable declarations * Reformat printk string and use pr_info() No functionality change. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Young <dyoung@redhat.com> Reviewed-by: Joerg Roedel <jroedel@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: WANG Chao <chaowang@redhat.com> Cc: jerry_hoemann@hp.com Link: http://lkml.kernel.org/r/1445246268-26285-6-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21x86/setup/crash: Remove alignment variableBorislav Petkov
Use a macro instead. No functionality change. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Young <dyoung@redhat.com> Reviewed-by: Joerg Roedel <jroedel@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: WANG Chao <chaowang@redhat.com> Cc: jerry_hoemann@hp.com Link: http://lkml.kernel.org/r/1445246268-26285-5-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21x86/setup: Cleanup crashkernel reservation functionsBorislav Petkov
* Shorten variable names * Realign code, space out for better readability No code changed: # arch/x86/kernel/setup.o: text data bss dec hex filename 4543 3096 69904 77543 12ee7 setup.o.before 4543 3096 69904 77543 12ee7 setup.o.after md5: 8a1b7c6738a553ca207b56bd84a8f359 setup.o.before.asm 8a1b7c6738a553ca207b56bd84a8f359 setup.o.after.asm Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Young <dyoung@redhat.com> Reviewed-by: Joerg Roedel <jroedel@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: WANG Chao <chaowang@redhat.com> Cc: jerry_hoemann@hp.com Link: http://lkml.kernel.org/r/1445246268-26285-4-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21x86/setup: Do not reserve crashkernel high memory if low reservation failedBaoquan He
People reported that when allocating crashkernel memory using the ",high" and ",low" syntax, there were cases where the reservation of the high portion succeeds but the reservation of the low portion fails. Then kexec can load the kdump kernel successfully, but booting the kdump kernel fails as there's no low memory. The low memory allocation for the kdump kernel can fail on large systems for a couple of reasons. For example, the manually specified crashkernel low memory can be too large and thus no adequate memblock region would be found. Therefore, we try to reserve low memory for the crash kernel *after* the high memory portion has been allocated. If that fails, we free crashkernel high memory too and return. The user can then take measures accordingly. Tested-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Baoquan He <bhe@redhat.com> [ Massage text. ] Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Joerg Roedel <jroedel@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Young <dyoung@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: WANG Chao <chaowang@redhat.com> Cc: jerry_hoemann@hp.com Cc: yinghai@kernel.org Link: http://lkml.kernel.org/r/1445246268-26285-2-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20x86/mm, kasan: Silence KASAN warnings in get_wchan()Andrey Ryabinin
get_wchan() is racy by design, it may access volatile stack of running task, thus it may access redzone in a stack frame and cause KASAN to warn about this. Use READ_ONCE_NOCHECK() to silence these warnings. Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de> Cc: kasan-dev <kasan-dev@googlegroups.com> Link: http://lkml.kernel.org/r/1445243838-17763-3-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20perf/x86: Add support for PERF_SAMPLE_BRANCH_CALLStephane Eranian
This patch enables the suport for the PERF_SAMPLE_BRANCH_CALL for Intel x86 processors. When the processor support LBR filtering this the selection is done in hardware. Otherwise, the filter is applied by software. Note that we chose to include zero length calls because they also represent calls. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: khandual@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1444720151-10275-3-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20perf/x86: Fix time_shift in perf_event_mmap_pageAdrian Hunter
Commit: b20112edeadf ("perf/x86: Improve accuracy of perf/sched clock") allowed the time_shift value in perf_event_mmap_page to be as much as 32. Unfortunately the documented algorithms for using time_shift have it shifting an integer, whereas to work correctly with the value 32, the type must be u64. In the case of perf tools, Intel PT decodes correctly but the timestamps that are output (for example by perf script) have lost 32-bits of granularity so they look like they are not changing at all. Fix by limiting the shift to 31 and adjusting the multiplier accordingly. Also update the documentation of perf_event_mmap_page so that new code based on it will be more future-proof. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: b20112edeadf ("perf/x86: Improve accuracy of perf/sched clock") Link: http://lkml.kernel.org/r/1445001845-13688-2-git-send-email-adrian.hunter@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-19sched/x86: Fix typo in __switch_to() commentsChuck Ebbert
Fix obvious mistake: FS/GS should be DS/ES. Signed-off-by: Chuck Ebbert <cebbert.lkml@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20151014143119.78858eeb@r5 Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-19x86/smpboot: Fix CPU #1 boot timeoutLen Brown
The following commit: a9bcaa02a5104ac ("x86/smpboot: Remove SIPI delays from cpu_up()") Caused some Intel Core2 processors to time-out when bringing up CPU #1, resulting in the missing of that CPU after bootup. That patch reduced the SIPI delays from udelay() 300, 200 to udelay() 0, 0 on modern processors. Several Intel(R) Core(TM)2 systems failed to bring up CPU #1 10/10 times after that change. Increasing either of the SIPI delays to udelay(1) results in success. So here we increase both to udelay(10). While this may be 20x slower than the absolute minimum, it is still 20x to 30x faster than the original code. Tested-by: Donald Parsons <dparsons@brightdsl.net> Tested-by: Shane <shrybman@teksavvy.com> Signed-off-by: Len Brown <len.brown@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dparsons@brightdsl.net Cc: shrybman@teksavvy.com Link: http://lkml.kernel.org/r/6dd554ee8945984d85aafb2ad35793174d068af0.1444968087.git.len.brown@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-19x86/smpboot: Fix cpu_init_udelay=10000 corner case boot parameter misbehaviorLen Brown
For legacy machines cpu_init_udelay defaults to 10,000. For modern machines it is set to 0. The user should be able to set cpu_init_udelay to any value on the cmdline, including 10,000. Before this patch, that was seen as "unchanged from default" and thus on a modern machine, the user request was ignored and the delay was set to 0. Signed-off-by: Len Brown <len.brown@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dparsons@brightdsl.net Cc: shrybman@teksavvy.com Link: http://lkml.kernel.org/r/de363cdbbcfcca1d22569683f7eb9873e0177251.1444968087.git.len.brown@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-16x86/ioapic: Disable interrupts when re-routing legacy IRQsVitaly Kuznetsov
A sporadic hang with consequent crash is observed when booting Hyper-V Gen1 guests: Call Trace: <IRQ> [<ffffffff810ab68d>] ? trace_hardirqs_off+0xd/0x10 [<ffffffff8107b616>] queue_work_on+0x46/0x90 [<ffffffff81365696>] ? add_interrupt_randomness+0x176/0x1d0 ... <EOI> [<ffffffff81471ddb>] ? _raw_spin_unlock_irqrestore+0x3b/0x60 [<ffffffff810c295e>] __irq_put_desc_unlock+0x1e/0x40 [<ffffffff810c5c35>] irq_modify_status+0xb5/0xd0 [<ffffffff8104adbb>] mp_register_handler+0x4b/0x70 [<ffffffff8104c55a>] mp_irqdomain_alloc+0x1ea/0x2a0 [<ffffffff810c7f10>] irq_domain_alloc_irqs_recursive+0x40/0xa0 [<ffffffff810c860c>] __irq_domain_alloc_irqs+0x13c/0x2b0 [<ffffffff8104b070>] alloc_isa_irq_from_domain.isra.1+0xc0/0xe0 [<ffffffff8104bfa5>] mp_map_pin_to_irq+0x165/0x2d0 [<ffffffff8104c157>] pin_2_irq+0x47/0x80 [<ffffffff81744253>] setup_IO_APIC+0xfe/0x802 ... [<ffffffff814631c0>] ? rest_init+0x140/0x140 The issue is easily reproducible with a simple instrumentation: if mdelay(10) is put between mp_setup_entry() and mp_register_handler() calls in mp_irqdomain_alloc() Hyper-V guest always fails to boot when re-routing IRQ0. The issue seems to be caused by the fact that we don't disable interrupts while doing IOPIC programming for legacy IRQs and IRQ0 actually happens. Protect the setup sequence against concurrent interrupts. [ tglx: Make the protection unconditional and not only for legacy interrupts ] Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: K. Y. Srinivasan <kys@microsoft.com> Link: http://lkml.kernel.org/r/1444930943-19336-1-git-send-email-vkuznets@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-10-16x86/setup: Extend low identity map to cover whole kernel rangePaolo Bonzini
On 32-bit systems, the initial_page_table is reused by efi_call_phys_prolog as an identity map to call SetVirtualAddressMap. efi_call_phys_prolog takes care of converting the current CPU's GDT to a physical address too. For PAE kernels the identity mapping is achieved by aliasing the first PDPE for the kernel memory mapping into the first PDPE of initial_page_table. This makes the EFI stub's trick "just work". However, for non-PAE kernels there is no guarantee that the identity mapping in the initial_page_table extends as far as the GDT; in this case, accesses to the GDT will cause a page fault (which quickly becomes a triple fault). Fix this by copying the kernel mappings from swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at identity mapping. For some reason, this is only reproducible with QEMU's dynamic translation mode, and not for example with KVM. However, even under KVM one can clearly see that the page table is bogus: $ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize $ gdb (gdb) target remote localhost:1234 (gdb) hb *0x02858f6f Hardware assisted breakpoint 1 at 0x2858f6f (gdb) c Continuing. Breakpoint 1, 0x02858f6f in ?? () (gdb) monitor info registers ... GDT= 0724e000 000000ff IDT= fffbb000 000007ff CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=00000690 ... The page directory is sane: (gdb) x/4wx 0x32b7000 0x32b7000: 0x03398063 0x03399063 0x0339a063 0x0339b063 (gdb) x/4wx 0x3398000 0x3398000: 0x00000163 0x00001163 0x00002163 0x00003163 (gdb) x/4wx 0x3399000 0x3399000: 0x00400003 0x00401003 0x00402003 0x00403003 but our particular page directory entry is empty: (gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4 0x32b7070: 0x00000000 [ It appears that you can skate past this issue if you don't receive any interrupts while the bogus GDT pointer is loaded, or if you avoid reloading the segment registers in general. Andy Lutomirski provides some additional insight: "AFAICT it's entirely permissible for the GDTR and/or LDT descriptor to point to unmapped memory. Any attempt to use them (segment loads, interrupts, IRET, etc) will try to access that memory as if the access came from CPL 0 and, if the access fails, will generate a valid page fault with CR2 pointing into the GDT or LDT." Up until commit 23a0d4e8fa6d ("efi: Disable interrupts around EFI calls, not in the epilog/prolog calls") interrupts were disabled around the prolog and epilog calls, and the functional GDT was re-installed before interrupts were re-enabled. Which explains why no one has hit this issue until now. ] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reported-by: Laszlo Ersek <lersek@redhat.com> Cc: <stable@vger.kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Matt Fleming <matt.fleming@intel.com> [ Updated changelog. ]
2015-10-15x86, ACPI: Handle apic/x2apic entries in MADT in correct orderLukasz Anaczkowski
ACPI specifies the following rules when listing APIC IDs: (1) Boot processor is listed first (2) For multi-threaded processors, BIOS should list the first logical processor of each of the individual multi-threaded processors in MADT before listing any of the second logical processors. (3) APIC IDs < 0xFF should be listed in APIC subtable, APIC IDs >= 0xFF should be listed in X2APIC subtable Because of above, when there's more than 0xFF logical CPUs, BIOS interleaves APIC/X2APIC subtables. Assuming, there's 72 cores, 72 hyper-threads each, 288 CPUs total, listing is like this: APIC (0,4,8, .., 252) X2APIC (258,260,264, .. 284) APIC (1,5,9,...,253) X2APIC (259,261,265,...,285) APIC (2,6,10,...,254) X2APIC (260,262,266,..,286) APIC (3,7,11,...,251) X2APIC (255,261,262,266,..,287) Now, before this patch, due to how ACPI MADT subtables were parsed (BSP then X2APIC then APIC), kernel enumerated CPUs in reverted order (i.e. high APIC IDs were getting low logical IDs, and low APIC IDs were getting high logical IDs). This is wrong for the following reasons: () it's hard to predict how cores and threads are enumerated () when it's hard to predict, s/w threads cannot be properly affinitized causing significant performance impact due to e.g. inproper cache sharing () enumeration is inconsistent with how threads are enumerated on other Intel Xeon processors So, order in which MADT APIC/X2APIC handlers are passed is reverse and both handlers are passed to be called during same MADT table to walk to achieve correct CPU enumeration. In scenario when someone boots kernel with options 'maxcpus=72 nox2apic', in result less cores may be booted, since some of the CPUs the kernel will try to use will have APIC ID >= 0xFF. In such case, one should not pass 'nox2apic'. Disclimer: code parsing MADT APIC/X2APIC has not been touched since 2009, when X2APIC support was initially added. I do not know why MADT parsing code was added in the reversed order in the first place. I guess it didn't matter at that time since nobody cared about cores with APIC IDs >= 0xFF, right? This patch is based on work of "Yinghai Lu <yinghai@kernel.org>" previously published at https://lkml.org/lkml/2013/1/21/563 Here's the explanation why parsing interface needs to be changed and why simpler approach will not work https://lkml.org/lkml/2015/9/7/285 Signed-off-by: Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> (commit message) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-10-14Merge tag 'efi-next' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into core/efi Pull v4.4 EFI updates from Matt Fleming: - Make the EFI System Resource Table (ESRT) driver explicitly non-modular by ripping out the module_* code since Kconfig doesn't allow it to be built as a module anyway. (Paul Gortmaker) - Make the x86 efi=debug kernel parameter, which enables EFI debug code and output, generic and usable by arm64. (Leif Lindholm) - Add support to the x86 EFI boot stub for 64-bit Graphics Output Protocol frame buffer addresses. (Matt Fleming) - Detect when the UEFI v2.5 EFI_PROPERTIES_TABLE feature is enabled in the firmware and set an efi.flags bit so the kernel knows when it can apply more strict runtime mapping attributes - Ard Biesheuvel - Auto-load the efi-pstore module on EFI systems, just like we currently do for the efivars module. (Ben Hutchings) - Add "efi_fake_mem" kernel parameter which allows the system's EFI memory map to be updated with additional attributes for specific memory ranges. This is useful for testing the kernel code that handles the EFI_MEMORY_MORE_RELIABLE memmap bit even if your firmware doesn't include support. (Taku Izumi) Note: there is a semantic conflict between the following two commits: 8a53554e12e9 ("x86/efi: Fix multiple GOP device support") ae2ee627dc87 ("efifb: Add support for 64-bit frame buffer addresses") I fixed up the interaction in the merge commit, changing the type of current_fb_base from u32 to u64. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-13x86/early_printk: Set __iomem address space for IOAndy Shevchenko
There are following warnings on unpatched code: arch/x86/kernel/early_printk.c:198:32: warning: incorrect type in initializer (different address spaces) arch/x86/kernel/early_printk.c:198:32: expected void [noderef] <asn:2>*vaddr arch/x86/kernel/early_printk.c:198:32: got unsigned int [usertype] *<noident> arch/x86/kernel/early_printk.c:205:32: warning: incorrect type in initializer (different address spaces) arch/x86/kernel/early_printk.c:205:32: expected void [noderef] <asn:2>*vaddr arch/x86/kernel/early_printk.c:205:32: got unsigned int [usertype] *<noident> Annotate it proper. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: http://lkml.kernel.org/r/1444646837-42615-1-git-send-email-andriy.shevchenko@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-10-12x86/microcode/amd: Do not overwrite final patch levelsBorislav Petkov
A certain number of patch levels of applied microcode should not be overwritten by the microcode loader, otherwise bad things will happen. Check those and abort update if the current core has one of those final patch levels applied by the BIOS. 32-bit needs special handling, of course. See https://bugzilla.suse.com/show_bug.cgi?id=913996 for more info. Tested-by: Peter Kirchgeßner <pkirchgessner@t-online.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/1444641762-9437-7-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-12x86/microcode/amd: Extract current patch level read to a functionBorislav Petkov
Pave the way for checking the current patch level of the microcode in a core. We want to be able to do stuff depending on the patch level - in this case decide whether to update or not. But that will be added in a later patch. Drop unused local var uci assignment, while at it. Integrate a fix for 32-bit and CONFIG_PARAVIRT from Takashi Iwai: Use native_rdmsr() in check_current_patch_level() because with CONFIG_PARAVIRT enabled and on 32-bit, where we run before paging has been enabled, we cannot deref pv_info yet. Or we could, but we'd need to access its physical address. This way of fixing it is simpler. See: https://bugzilla.suse.com/show_bug.cgi?id=943179 for the background. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Takashi Iwai <tiwai@suse.com>: Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/1444641762-9437-6-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-12efi: Add "efi_fake_mem" boot optionTaku Izumi
This patch introduces new boot option named "efi_fake_mem". By specifying this parameter, you can add arbitrary attribute to specific memory range. This is useful for debugging of Address Range Mirroring feature. For example, if "efi_fake_mem=2G@4G:0x10000,2G@0x10a0000000:0x10000" is specified, the original (firmware provided) EFI memmap will be updated so that the specified memory regions have EFI_MEMORY_MORE_RELIABLE attribute (0x10000): <original> efi: mem36: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x0000000100000000-0x00000020a0000000) (129536MB) <updated> efi: mem36: [Conventional Memory| |MR| | | | |WB|WT|WC|UC] range=[0x0000000100000000-0x0000000180000000) (2048MB) efi: mem37: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x0000000180000000-0x00000010a0000000) (61952MB) efi: mem38: [Conventional Memory| |MR| | | | |WB|WT|WC|UC] range=[0x00000010a0000000-0x0000001120000000) (2048MB) efi: mem39: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x0000001120000000-0x00000020a0000000) (63488MB) And you will find that the following message is output: efi: Memory: 4096M/131455M mirrored memory Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2015-10-12Merge branch 'x86/ras' into ras/core, to pick up changesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-12x86/kexec: Remove obsolete 'in_crash_kexec' flagMinfei Huang
Previously, UV NMI used the 'in_crash_kexec' flag to determine whether we are in a kdump kernel or not: 5edd19af18a36a4 ("x86, UV: Make kdump avoid stack dumps") But this flags was removed in the following commit: 9c48f1c629ecfa1 ("x86, nmi: Wire up NMI handlers to new routines") Since it isn't used any more, remove it. Signed-off-by: Minfei Huang <mnfhuang@gmail.com> Acked-by: Don Zickus <dzickus@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: cpw@sgi.com Cc: kexec@lists.infradead.org Cc: mhuang@redhat.com Link: http://lkml.kernel.org/r/1444070155-17934-1-git-send-email-mhuang@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-11x86/io_apic: Make eoi_ioapic_pin() staticAndy Shevchenko
We have to define internally used function as static, otherwise the following warning will be generated: arch/x86/kernel/apic/io_apic.c:532:6: warning: no previous prototype for 'eoi_ioapic_pin' [-Wmissing-prototypes] Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Jiang Liu <jiang.liu@linux.intel.com> Link: http://lkml.kernel.org/r/1444400685-98611-1-git-send-email-andriy.shevchenko@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-10-09x86/asm: Remove thread_info.sysenter_returnAndy Lutomirski
It's no longer needed. We could reinstate something like it as an optimization, which would remove two cachelines from the fast syscall entry working set. I benchmarked it, and it makes no difference whatsoever to the performance of cache-hot compat syscalls on Sandy Bridge. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/f08cc0cff30201afe9bb565c47134c0a6c1a96a2.1444091585.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-08Merge branch 'perf/urgent' into perf/core, before pulling new changesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-07x86/vdso: Remove runtime 32-bit vDSO selectionAndy Lutomirski
32-bit userspace will now always see the same vDSO, which is exactly what used to be the int80 vDSO. Subsequent patches will clean it up and make it support SYSENTER and SYSCALL using alternatives. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/e7e6b3526fa442502e6125fe69486aab50813c32.1444091584.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06perf/x86/intel/uncore: Fix multi-segment problem of perf_event_intel_uncoreTaku Izumi
In multi-segment system, uncore devices may belong to buses whose segment number is other than 0: .... 0000:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03 ... In that case, relation of bus number and physical id may be broken because "uncore_pcibus_to_physid" doesn't take account of PCI segment. For example, bus 0000:ff and 0001:ff uses the same entry of "uncore_pcibus_to_physid" array. This patch fixes this problem by introducing the segment-aware pci2phy_map instead. Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: hpa@zytor.com Link: http://lkml.kernel.org/r/1443096621-4119-1-git-send-email-izumi.taku@jp.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06perf/x86: Add Intel cstate PMUs supportKan Liang
This patch adds new PMUs to support cstate related free running (read-only) counters. These counters may be used simultaneously by other tools, such as turbostat. However, it still make sense to implement them in perf. Because we can conveniently collect them together with other events, and allow to use them from tools without special MSR access code. These counters include CORE_C*_RESIDENCY and PKG_C*_RESIDENCY. According to counters' scope and category, two PMUs are registered with the perf_event core subsystem. - 'cstate_core': The counter is available for each physical core. The counters include CORE_C*_RESIDENCY. - 'cstate_pkg': The counter is available for each physical package. The counters include PKG_C*_RESIDENCY. The events are exposed in sysfs for use by perf stat and other tools. The files are: /sys/devices/cstate_core/events/c*-residency /sys/devices/cstate_pkg/events/c*-residency These events only support system-wide mode counting. The /sys/devices/cstate_*/cpumask file can be used by tools to figure out which CPUs to monitor by default. The PMU type (attr->type) is dynamically allocated and is available from /sys/devices/core_misc/type and /sys/device/cstate_*/type. Sampling is not supported. Here is an example. - To caculate the fraction of time when the core is running in C6 state CORE_C6_time% = CORE_C6_RESIDENCY / TSC # perf stat -x, -e"cstate_core/c6-residency/,msr/tsc/" -C0 -- taskset -c 0 sleep 5 11838820015,,cstate_core/c6-residency/,5175919658,100.00 11877130740,,msr/tsc/,5175922010,100.00 For sleep, 99.7% of time we ran in C6 state. # perf stat -x, -e"cstate_core/c6-residency/,msr/tsc/" -C0 -- taskset -c 0 busyloop 1253316,,cstate_core/c6-residency/,4360969154,100.00 10012635248,,msr/tsc/,4360972366,100.00 For busyloop, 0.01% of time we ran in C6 state. Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: eranian@google.com Link: http://lkml.kernel.org/r/1443443404-8581-1-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core, sched/x86: Kill thread_info::saved_preempt_countPeter Zijlstra
With the introduction of the context switch preempt_count invariant, and the demise of PREEMPT_ACTIVE, its pointless to save/restore the per-cpu preemption count, it must always be 2. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-03Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Fixes all around the map: W+X kernel mapping fix, WCHAN fixes, two build failure fixes for corner case configs, x32 header fix and a speling fix" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/headers/uapi: Fix __BITS_PER_LONG value for x32 builds x86/mm: Set NX on gap between __ex_table and rodata x86/kexec: Fix kexec crash in syscall kexec_file_load() x86/process: Unify 32bit and 64bit implementations of get_wchan() x86/process: Add proper bound checks in 64bit get_wchan() x86, efi, kasan: Fix build failure on !KASAN && KMEMCHECK=y kernels x86/hyperv: Fix the build in the !CONFIG_KEXEC_CORE case x86/cpufeatures: Correct spelling of the HWP_NOTIFY flag
2015-10-02x86/kexec: Fix kexec crash in syscall kexec_file_load()Lee, Chun-Yi
The original bug is a page fault crash that sometimes happens on big machines when preparing ELF headers: BUG: unable to handle kernel paging request at ffffc90613fc9000 IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260 The bug is caused by us under-counting the number of memory ranges and subsequently not allocating enough ELF header space for them. The bug is typically masked on smaller systems, because the ELF header allocation is rounded up to the next page. This patch modifies the code in fill_up_crash_elf_data() by using walk_system_ram_res() instead of walk_system_ram_range() to correctly count the max number of crash memory ranges. That's because the walk_system_ram_range() filters out small memory regions that reside in the same page, but walk_system_ram_res() does not. Here's how I found the bug: After tracing prepare_elf64_headers() and prepare_elf64_ram_headers_callback(), the code uses walk_system_ram_res() to fill-in crash memory regions information to the program header, so it counts those small memory regions that reside in a page area. But, when the kernel was using walk_system_ram_range() in fill_up_crash_elf_data() to count the number of crash memory regions, it filters out small regions. I printed those small memory regions, for example: kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0 Based on the code in walk_system_ram_range(), this memory region will be filtered out: pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593 end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593 end_pfn - pfn = 0x77593 - 0x77593 = 0 <=== if (end_pfn > pfn) is FALSE So, the max_nr_ranges that's counted by the kernel doesn't include small memory regions - causing us to under-allocate the required space. That causes the page fault crash that happens in a later code path when preparing ELF headers. This bug is not easy to reproduce on small machines that have few CPUs, because the allocated page aligned ELF buffer has more free space to cover those small memory regions' PT_LOAD headers. Signed-off-by: Lee, Chun-Yi <jlee@suse.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Takashi Iwai <tiwai@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: kexec@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1443531537-29436-1-git-send-email-jlee@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-01x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERORadim Krčmář
Newer KVM won't be exposing PVCLOCK_COUNTS_FROM_ZERO anymore. The purpose of that flags was to start counting system time from 0 when the KVM clock has been initialized. We can achieve the same by selecting one read as the initial point. A simple subtraction will work unless the KVM clock count overflows earlier (has smaller width) than scheduler's cycle count. We should be safe till x86_128. Because PVCLOCK_COUNTS_FROM_ZERO was enabled only on new hypervisors, setting sched clock as stable based on PVCLOCK_TSC_STABLE_BIT might regress on older ones. I presume we don't need to change kvm_clock_read instead of introducing kvm_sched_clock_read. A problem could arise in case sched_clock is expected to return the same value as get_cycles, but we should have merged those clocks in that case. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01x86/irq: Drop unlikely before IS_ERR_OR_NULLGeliang Tang
IS_ERR_OR_NULL already contain an unlikely compiler flag. Drop it. Signed-off-by: Geliang Tang <geliangtang@163.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Link: http://lkml.kernel.org/r/03d18502ed7ed417f136c091f417d2d88c147ec6.1443667610.git.geliangtang@163.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-10-01perf/x86/intel/uncore: Do not use macro DEFINE_PCI_DEVICE_TABLE()Vaishali Thakkar
The DEFINE_PCI_DEVICE_TABLE() macro is deprecated. Use 'struct pci_device_id' instead of DEFINE_PCI_DEVICE_TABLE(), with the goal of getting rid of this macro completely. This Coccinelle semantic patch performs this transformation: @@ identifier a; declarer name DEFINE_PCI_DEVICE_TABLE; initializer i; @@ - DEFINE_PCI_DEVICE_TABLE(a) + const struct pci_device_id a[] = i; Signed-off-by: Vaishali Thakkar <vthakkar1994@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20151001085201.GA16939@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-30x86/signal: Deinline get_sigframe, save 240 bytesDenys Vlasenko
This function compiles to 277 bytes of machine code and has 4 callsites. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Link: http://lkml.kernel.org/r/1443443037-22077-4-git-send-email-dvlasenk@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-30x86: Deinline early_console_register, save 403 bytesDenys Vlasenko
This function compiles to 60 bytes of machine code. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Link: http://lkml.kernel.org/r/1443443037-22077-3-git-send-email-dvlasenk@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-30x86/e820: Deinline e820_type_to_string, save 126 bytesDenys Vlasenko
This function compiles to 102 bytes of machine code. It has two callsites. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Link: http://lkml.kernel.org/r/1443443037-22077-2-git-send-email-dvlasenk@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-30x86/process: Unify 32bit and 64bit implementations of get_wchan()Thomas Gleixner
The stack layout and the functionality is identical. Use the 64bit version for all of x86. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@alien8.de> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: kasan-dev <kasan-dev@googlegroups.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de> Link: http://lkml.kernel.org/r/20150930083302.779694618@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-30x86/process: Add proper bound checks in 64bit get_wchan()Thomas Gleixner
Dmitry Vyukov reported the following using trinity and the memory error detector AddressSanitizer (https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel). [ 124.575597] ERROR: AddressSanitizer: heap-buffer-overflow on address ffff88002e280000 [ 124.576801] ffff88002e280000 is located 131938492886538 bytes to the left of 28857600-byte region [ffffffff81282e0a, ffffffff82e0830a) [ 124.578633] Accessed by thread T10915: [ 124.579295] inlined in describe_heap_address ./arch/x86/mm/asan/report.c:164 [ 124.579295] #0 ffffffff810dd277 in asan_report_error ./arch/x86/mm/asan/report.c:278 [ 124.580137] #1 ffffffff810dc6a0 in asan_check_region ./arch/x86/mm/asan/asan.c:37 [ 124.581050] #2 ffffffff810dd423 in __tsan_read8 ??:0 [ 124.581893] #3 ffffffff8107c093 in get_wchan ./arch/x86/kernel/process_64.c:444 The address checks in the 64bit implementation of get_wchan() are wrong in several ways: - The lower bound of the stack is not the start of the stack page. It's the start of the stack page plus sizeof (struct thread_info) - The upper bound must be: top_of_stack - TOP_OF_KERNEL_STACK_PADDING - 2 * sizeof(unsigned long). The 2 * sizeof(unsigned long) is required because the stack pointer points at the frame pointer. The layout on the stack is: ... IP FP ... IP FP. So we need to make sure that both IP and FP are in the bounds. Fix the bound checks and get rid of the mix of numeric constants, u64 and unsigned long. Making all unsigned long allows us to use the same function for 32bit as well. Use READ_ONCE() when accessing the stack. This does not prevent a concurrent wakeup of the task and the stack changing, but at least it avoids TOCTOU. Also check task state at the end of the loop. Again that does not prevent concurrent changes, but it avoids walking for nothing. Add proper comments while at it. Reported-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Based-on-patch-from: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@alien8.de> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: kasan-dev <kasan-dev@googlegroups.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20150930083302.694788319@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-30Merge branch 'x86/for-kvm' into x86/apicThomas Gleixner
Pull in the apic change which is provided for kvm folks to pull into their tree.