Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 679826 - app-emulation/xen seldom boots -- hangs on masked ExtINT on CPU#[various number]
Summary: app-emulation/xen seldom boots -- hangs on masked ExtINT on CPU#[various num...
Status: RESOLVED INVALID
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: AMD64 Linux
: Normal normal (vote)
Assignee: Gentoo Xen Devs
URL:
Whiteboard:
Keywords:
: 680472 (view as bug list)
Depends on:
Blocks:
 
Reported: 2019-03-09 06:00 UTC by John L. Poole
Modified: 2019-09-26 04:40 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
Serial Console Log March 3 - March 8 of many boots (app-emulation%3Axen-tools-4.12.0_rc4%3A20190309-030937.log.zip,55.78 KB, application/x-zip-compressed)
2019-03-09 06:00 UTC, John L. Poole
Details
archive of grub & Linux .config (gentoo_bug_679824_addl.tar.bz2,39.20 KB, application/x-bzip)
2019-03-09 17:45 UTC, John L. Poole
Details
Crib Notes For Making/Deploying Patch (making_debug_patch.txt,2.30 KB, text/plain)
2019-03-13 01:41 UTC, John L. Poole
Details
Debug Patch with JLPOOLEDEBUG (JLPOOLE_201903121838_xen_4.11.1-r1.patch,5.96 KB, patch)
2019-03-13 01:43 UTC, John L. Poole
Details | Diff
output of dmidecode (637 lines) (dmidecode.log,21.06 KB, text/plain)
2019-03-13 02:07 UTC, John L. Poole
Details
lspci -vvv output (513 lines) (201903121901_lspci_vvv.log,34.54 KB, text/plain)
2019-03-13 02:07 UTC, John L. Poole
Details
Xen Console All Diagnostics [ key '*' (ascii '2a') => print all diagnostics] (Xen_Console_diagnostics_all_before_watchdog_reboot.log,73.36 KB, text/plain)
2019-03-13 02:21 UTC, John L. Poole
Details
Boot Log (Unsuccessful) from EFI Console (efi_boot_2019)0316_2139.log,9.67 KB, text/plain)
2019-03-17 05:28 UTC, John L. Poole
Details

Note You need to log in before you can comment on or make changes to this bug.
Description John L. Poole 2019-03-09 06:00:13 UTC
Created attachment 568246 [details]
Serial Console Log March 3 - March 8 of many boots

I've been wrestling with getting Xen to boot on my Intel Atom.

What is happening is that at boot up of the Xen kernel, it often hangs at the point where the various CPUs are being masked.  The stop point appears to be random.  And, sometimes I get lucky and make it past the ExtINT stage and the Xen kernel successfully boots.  (I still do not get a login: entry and have to access the instance via ssh.)  What is novel here is that the hanging can occur after  masked ExtINT on CPU#1 or #2 or #3 &etc.  There is no pattern.

I'm attaching my serial console log started March 3, 2019, which has been on APPEND mode since them.  

From my Serial console log started  2019.03.03 09:51:00

Some later points where boot of kernel hung in chronological order:

Line#   Entry
44254   (XEN) [2019-03-09 05:29:44] masked ExtINT on CPU#1
reboot
44448   (XEN) [2019-03-09 05:31:38] masked ExtINT on CPU#1
reboot
44644   (XEN) [2019-03-09 05:32:59] masked ExtINT on CPU#3
reboot
44867   (XEN) [2019-03-09 05:34:26] masked ExtINT on CPU#3

Note, I can successfully boot into a normal kernel 
zeta / # uname -a
Linux zeta 4.19.23-gentoo #8 SMP Mon Mar 4 20:48:52 PST 2019 x86_64 Intel(R) Atom(TM) CPU C2750 @ 2.40GHz GenuineIntel GNU/Linux
zeta / #

I have another bug pending for app-emulation/xen-tools-4.12.0_rc4  in Bug #679824.  I can provide you my /boot/grub/grub.cfg, /etec/default/grub, my kernel /usr/src/linux/.config.  Just let me know what you'd like.

I had posted to the Xen Users mailing list and no response.  See:
https://lists.xenproject.org/archives/html/xen-users/2019-03/msg00006.html
https://lists.xenproject.org/archives/html/xen-users/2019-03/msg00018.html

I direct you attention to my msg00018 posting where I report out a possible theory: (spurious interrupt)

My postings on the Xen Users list also identifies where in the Xen code the problem may be occurring.  I realize this issue may really be an issue within Xen, but I thought I'd start here just for the record.
Comment 1 Tomáš Mózes 2019-03-09 11:07:27 UTC
So this is a new installation where no Xen version works.

Have you tried legacy bios boot?
Disabling xen security features? Xpti..

Does other os with xen work on it?
Comment 2 John L. Poole 2019-03-09 17:43:57 UTC
Responding to question of  Tomáš Mózes 2019-03-09 11:07:27 UTC :
1) This is an upgrade installation, not a new installation.  I purchased a Supermicro Intel Atom based unit in October 2016 and then undertook to install Xen.  I had lots of problems trying to use Gentoo as the DOM0.  I did successfully install using the Debian -- the then recommended by the Xen Project on their wiki.  But I wanted Gentoo and I wanted to help clear the way for others desiring Gentoo instead of Debian.  I ran into problems, for instance see Gentoo Bug #601872 "xen.gz Kernel Load And Hangs".  My attempts around December 2016 revealed there was a bug in binutils or something re: "COFF and never PE" - see https://lists.xenproject.org/archives/html/xen-devel/2016-12/msg00815.html. (Jan Beulich ?)  I got around that issue by patching and/or using a patched package, then I ran into another problem with Grub.  I ended up having a dialog with the person who seemed to specialize on Xen and Grub (also worked for my employers in another division) who advised that Grub was not quite ready for launching Xen, it might not be until Spring 2017.  So I cut my losses and adopted an EFI console procedure where I manually loaded the kernel.  That manual procedure served me reliably. Xen 4.7.1 or thereabouts was the version I was using.  I do not recall ever running into the masked ExtINT issue then.

In December 2018 through now, I thought I'd try to get Grub to work assuming the milestone of getting Grub to load Xen was working now.

Specifications of this server are:
Product SKU: SuperServer 5018A-TN4 (Black)
Motherboard: Super A1SAi-2750F
Processor/Cache: 
    CPU
    Intel® Atom® Processor C2750
    CPU TDP 20W (8-Core)
    FCBGA 1283
    System-on-Chip
System Memory:
4x 204-pin DDR3 SO-DIMM slots
Supports up to 64GB DDR3 ECC memory

From the sale Quote 11/1/2016:
SYS-5018A-TN4-OTO-50
--OPTIMIZED SYS-5018A-TN4(x1)A1SAi-2750F, 504-203B
--MEM-DR316L-CL02-ES16(x4)16GB DDR3-1600 1.35V 2RX8 ECC SODIMM
--HDD-T4000-MG04ACA400E(x1)[NR]Toshiba 3.5" 4TB SATA 6Gb/s 7.2K
RPM 128M 512E

2) I have not tried legacy BIOS.  I recall looking into this option learning that "legacy BIOS" is just a mode UEFI runs to simulate BIOS.  Had I the option of truly replacing EUFI for BIOS, I think I would have chosen to go with legacy BIOS.

3) I have not tried disabling XPti -- I do not know what that is, but I'll look into it and give a try and update this bug with my findings.

4) Yes, Debian in 2016 worked, I was able to boot into DOM0 without incident.
Comment 3 John L. Poole 2019-03-09 17:45:42 UTC
Created attachment 568328 [details]
archive of grub & Linux .config

Attaching more information:
zeta /home/jlpoole/gentoobugs/679824 # tar -cjvf gentoo_bug_679824_addl.tar.bz2 addl
addl/
addl/default_grub_201903090841
addl/boot_grub_grub_201903090842.cfg
addl/linux_201903090840.config
zeta /home/jlpoole/gentoobugs/679824 #
Comment 4 John L. Poole 2019-03-09 17:55:45 UTC
I tried modify my command line by adding: 
xpti=false

per https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html#xpti-x86  which provides:

1.2.186 xpti (x86)

    = List of [ default | <boolean> | dom0=<bool> | domu=<bool> ]

    Default: false on hardware known not to be vulnerable to Meltdown (e.g. AMD) Default: true everywhere else

Override default selection of whether to isolate 64-bit PV guest page tables.

true activates page table isolation even on hardware not vulnerable by Meltdown for all domains.

false deactivates page table isolation on all systems for all domains.

default sets the default behaviour.

With dom0 and domu it is possible to control page table isolation for dom0 or guest domains only.
 

Here are snippets from my log of my attempt:

(XEN) Command line: placeholder vga=gfx-1024x768x16 com1=115200,8n1,pci console=com1,vga console_timestamps console_to_ring conring_size=64 log_buf_len=16M loglvl=all guest_loglvl=all sync_console=true sched_debug iommu=verbose apic_verbosity=verbose xpti=false no-real-mode edd=off
...
(XEN) [2019-03-09 17:52:09] HVM: HAP page sizes: 4kB, 2MB
(XEN) [2019-03-09 17:52:05] masked ExtINT on CPU#1
(XEN) [2019-03-09 17:52:05] masked ExtINT on CPU#2
[HUNG]
Comment 5 John L. Poole 2019-03-11 03:24:53 UTC
So I have been trying several times to start DOM0, and when I finally successfully pass through the portion of masking ExtINT, I'm able to log in remotely vis ssh.  The console on the server and the serial port do not show a login prompt.  The last entry on the serial console is "* Starting local ..." followed by "[ ok ]".  Then nothing more appears on the serial console until I perform a shutdown from another console.  But that's a minor point.  Aside from that, I was configuring a bridge and then my ssh session hung.  The console of the serial session had this error message:

* Starting local ...
[ ok ]
[  220.213278] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]
(XEN) [2019-03-11 03:16:13] Watchdog timer fired for domain 0
(XEN) [2019-03-11 03:16:13] Hardware Dom0 shutdown: watchdog rebooting machine

Lastly, I cannot say definitively, but, if I go through the following sequence, I seem to have better luck getting a successful launch of the kernel:

In grub, selecting either of the XEN menus, then clicking "e" to edit.  
Then scroll a line or two.
Then ctrl-X.
Hit return for the default menu entries appearing thereafter, two of them. 
Then the (XEN) reporting starts.
Comment 6 John L. Poole 2019-03-11 12:17:31 UTC
Tomas Mozes brought to my attention this thread where Juergen Gross-3 on Jan 11, 2019 suggested setting "pcid=false": http://xen.1045712.n5.nabble.com/xen-domU-segfaults-with-xpti-on-intel-based-systems-td5744423.html


So I tried adding "pcid=false" and the boot still hangs around the same place:

    Booting a command listBooting a command list



Loading Xen xen ...Loading Xen xen ...

WARNING: no console will be available to OSWARNING: no console will be available to OS

Loading Linux x86_64-4.19.23-gentoo ...Loading Linux x86_64-4.19.23-gentoo ...

Loading initial ramdisk ...Loading initial ramdisk ...

error: no suitable video mode found.
error: no suitable video mode found.
 Xen 4.11.1
(XEN) Xen version 4.11.1 (@[unknown]) (x86_64-pc-linux-gnu-gcc (Gentoo 7.3.0-r3 p1.4) 7.3.0) debug=n  Wed Mar  6 19:34:00 PST 2019
(XEN) Latest ChangeSet:
(XEN) Console output is synchronous.
(XEN) Bootloader: GRUB 2.02
(XEN) Command line: placeholder pcid=false vga=gfx-1024x768x16 com1=115200,8n1,pci console=com1,vga console_timestamps console_to_ring conring_size=64 log_buf_len=16M loglvl=all guest_loglvl=all sync_console=true sched_debug iommu=verbose apic_verbosity=verbose xpti=false no-real-mode edd=off
(XEN) Xen image load base address: 0
(XEN) Video information:
(XEN)  VGA is text mode 80x25, font 8x16
(XEN) Disc information:
(XEN)  Found 0 MBR signatures
(XEN)  Found 0 EDD information structures
(XEN) Multiboot-e820 RAM map:
(XEN)  0000000000000000 - 00000000000a0000 (usable)
(XEN)  0000000000100000 - 000000007e16d000 (usable)
(XEN)  000000007e16d000 - 000000007eba4000 (reserved)
(XEN)  000000007eba4000 - 000000007ed12000 (usable)
(XEN)  000000007ed12000 - 000000007f28d000 (ACPI NVS)
(XEN)  000000007f28d000 - 000000007f5f3000 (reserved)
(XEN)  000000007f5f3000 - 000000007f648000 type 20
(XEN)  000000007f648000 - 000000007f800000 (usable)
(XEN)  00000000e0000000 - 00000000e4000000 (reserved)
(XEN)  00000000fed01000 - 00000000fed04000 (reserved)
(XEN)  00000000fed08000 - 00000000fed09000 (reserved)
(XEN)  00000000fed0c000 - 00000000fed10000 (reserved)
(XEN)  00000000fed1c000 - 00000000fed1d000 (reserved)
(XEN)  00000000fef00000 - 00000000ff000000 (reserved)
(XEN)  00000000ff800000 - 0000000100000000 (reserved)
(XEN)  0000000100000000 - 0000000ff0000000 (usable)
(XEN) New Xen image base address: 0x7da00000
(XEN) ACPI Error (tbxfroot-0217): A valid RSDP was not found [20070126]
(XEN) System RAM: 63204MB (64721100kB)
(XEN) No NUMA configuration found
(XEN) Faking a node at 0000000000000000-0000000ff0000000
(XEN) Domain heap initialised
(XEN) Allocated console ring of 64 KiB.
(XEN) CPU Vendor: Intel, Family 6 (0x6), Model 77 (0x4d), Stepping 8 (raw 000406d8)
(XEN) found SMP MP-table at 000fd8a0
(XEN) DMI 2.7 present.
(XEN) Using APIC driver default
(XEN) Intel MultiProcessor Specification v1.4
(XEN)     Virtual Wire compatibility mode.
(XEN) OEM ID: A M I Product ID: ALASKA APIC at: 0xfee00000
(XEN) Processor #00 6:13 APIC version 20
(XEN) Processor #02 6:13 APIC version 20
(XEN) Processor #04 6:13 APIC version 20
(XEN) Processor #06 6:13 APIC version 20
(XEN) Processor #08 6:13 APIC version 20
(XEN) Processor #0a 6:13 APIC version 20
(XEN) Processor #0c 6:13 APIC version 20
(XEN) Processor #0e 6:13 APIC version 20
(XEN) I/O APIC #2 Version 32 at 0xfec00000.
(XEN) Enabling APIC mode:  Flat.  Using 1 I/O APICs
(XEN) Processors: 8
(XEN) SMP: Allowing 8 CPUs (0 hotplug CPUs)
(XEN) mapped APIC to ffff82cfffffb000 (fee00000)
(XEN) mapped IOAPIC to ffff82cfffffa000 (fec00000)
(XEN) IRQ limits: 24 GSI, 1528 MSI/MSI-X
(XEN) CPU0: Intel machine check reporting enabled
(XEN) Unrecognised CPU model 0x4d - assuming not reptpoline safe
(XEN) Speculative mitigation facilities:
(XEN)   Hardware features:
(XEN)   Compiled-in support: INDIRECT_THUNK SHADOW_PAGING
(XEN)   Xen settings: BTI-Thunk RETPOLINE, SPEC_CTRL: No, Other:
(XEN)   Support for VMs: PV: RSB, HVM: RSB
(XEN)   XPTI (64-bit PV only): Dom0 disabled, DomU disabled
(XEN)   PV L1TF shadowing: Dom0 disabled, DomU disabled
(XEN) Using scheduler: SMP Credit Scheduler (credit)
(XEN) Platform timer is 1.193MHz PIT
(XEN) Detected 2400.052 MHz processor.
(XEN) Initing memory sharing.
(XEN) alt table ffff82d08042a838 -> ffff82d08042c5ce
(XEN) I/O virtualisation disabled
(XEN) nr_sockets: 1
(XEN) enabled ExtINT on CPU#0
(XEN) ENABLING IO-APIC IRQs
(XEN)  -> Using new ACK method
(XEN) init IO_APIC IRQs
(XEN)  IO-APIC (apicid-pin) 2-0, 2-6, 2-7, 2-10, 2-11, 2-12, 2-15 not connected.
(XEN) ..TIMER: vector=0xF0 apic1=0 pin1=2 apic2=-1 pin2=-1
(XEN) number of MP IRQ sources: 39.
(XEN) number of IO-APIC #2 registers: 24.
(XEN) testing the IO APIC.......................
(XEN) IO APIC #2......
(XEN) .... register #00: 02000000
(XEN) .......    : physical APIC id: 02
(XEN) .......    : Delivery Type: 0
(XEN) .......    : LTS          : 0
(XEN) .... register #01: 00170020
(XEN) .......     : max redirection entries: 0017
(XEN) .......     : PRQ implemented: 0
(XEN) .......     : IO APIC version: 0020
(XEN) .... IRQ redirection table:
(XEN)  NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
(XEN)  00 000 00  1    0    0   0   0    0    0    00
(XEN)  01 001 01  0    0    0   0   0    1    1    28
(XEN)  02 001 01  0    0    0   0   0    1    1    F0
(XEN)  03 001 01  0    0    0   0   0    1    1    30
(XEN)  04 001 01  1    0    0   0   0    1    1    F1
(XEN)  05 001 01  0    0    0   0   0    1    1    38
(XEN)  06 000 00  1    0    0   0   0    0    0    00
(XEN)  07 000 00  1    0    0   0   0    0    0    00
(XEN)  08 001 01  0    0    0   0   0    1    1    40
(XEN)  09 001 01  1    1    0   0   0    1    1    48
(XEN)  0a 000 00  1    0    0   0   0    0    0    00
(XEN)  0b 000 00  1    0    0   0   0    0    0    00
(XEN)  0c 000 00  1    0    0   0   0    0    0    00
(XEN)  0d 001 01  0    0    0   0   0    1    1    50
(XEN)  0e 001 01  0    0    0   0   0    1    1    58
(XEN)  0f 000 00  1    0    0   0   0    0    0    00
(XEN)  10 001 01  1    1    0   1   0    1    1    60
(XEN)  11 001 01  1    1    0   1   0    1    1    68
(XEN)  12 001 01  1    1    0   1   0    1    1    70
(XEN)  13 001 01  1    1    0   1   0    1    1    78
(XEN)  14 001 01  1    1    0   1   0    1    1    88
(XEN)  15 001 01  1    1    0   1   0    1    1    90
(XEN)  16 001 01  1    1    0   1   0    1    1    98
(XEN)  17 001 01  1    1    0   1   0    1    1    A0
(XEN) Using vector-based indexing
(XEN) IRQ to pin mappings:
(XEN) IRQ240 -> 0:2
(XEN) IRQ40 -> 0:1
(XEN) IRQ48 -> 0:3
(XEN) IRQ241 -> 0:4
(XEN) IRQ56 -> 0:5
(XEN) IRQ64 -> 0:8
(XEN) IRQ72 -> 0:9
(XEN) IRQ80 -> 0:13
(XEN) IRQ88 -> 0:14
(XEN) IRQ96 -> 0:16
(XEN) IRQ104 -> 0:17
(XEN) IRQ112 -> 0:18
(XEN) IRQ120 -> 0:19
(XEN) IRQ136 -> 0:20
(XEN) IRQ144 -> 0:21
(XEN) IRQ152 -> 0:22
(XEN) IRQ160 -> 0:23
(XEN) .................................... done.
(XEN) Using local APIC timer interrupts.
(XEN) calibrating APIC timer ...
(XEN) ..... CPU clock speed is 2400.0484 MHz.
(XEN) ..... host bus clock speed is 100.0019 MHz.
(XEN) ..... bus_scale = 0x6669
(XEN) TSC deadline timer enabled
(XEN) [2019-03-11 11:54:13] mwait-idle: MWAIT substates: 0x3000020
(XEN) [2019-03-11 11:54:13] mwait-idle: v0.4.1 model 0x4d
(XEN) [2019-03-11 11:54:13] mwait-idle: lapic_timer_reliable_states 0xffffffff
(XEN) [2019-03-11 11:54:13] VMX: Supported advanced features:
(XEN) [2019-03-11 11:54:13]  - APIC MMIO access virtualisation
(XEN) [2019-03-11 11:54:13]  - APIC TPR shadow
(XEN) [2019-03-11 11:54:13]  - Extended Page Tables (EPT)
(XEN) [2019-03-11 11:54:13]  - Virtual-Processor Identifiers (VPID)
(XEN) [2019-03-11 11:54:13]  - Virtual NMI
(XEN) [2019-03-11 11:54:13]  - MSR direct-access bitmap
(XEN) [2019-03-11 11:54:13]  - Unrestricted Guest
(XEN) [2019-03-11 11:54:13]  - VM Functions
(XEN) [2019-03-11 11:54:13] HVM: ASIDs enabled.
(XEN) [2019-03-11 11:54:13] HVM: VMX enabled
(XEN) [2019-03-11 11:54:13] HVM: Hardware Assisted Paging (HAP) detected
(XEN) [2019-03-11 11:54:13] HVM: HAP page sizes: 4kB, 2MB
(XEN) [2019-03-11 11:54:09] masked ExtINT on CPU#1
(XEN) [2019-03-11 11:54:09] masked ExtINT on CPU#2

[HUNG]

Also, my perceived routine of going into edit mode and then not making any edits and invoking the command with Ctrl-X and subsequently accepting the default menu entries for XEN does not seem to make a difference on whether I get past the "masked ExtINT" issue.

Lastly, I am still able to boot a regular Gentoo kernel and go into a non-Xen mode without incident.  I was beginning to wonder if I have a hardware issue, but my ability to launch a regular Gentoo kernel suggests the problem I am encountering is something in the Xen Kernel that is not properly handling an interrupt of the CPU or whatever is going on in api.c's function setup_local_APIC(void) https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/apic.c;h=2a2432619e3edce2cdbc275abbd4e80ffcdcd9f0;hb=HEAD#l524
Comment 7 John L. Poole 2019-03-11 12:50:52 UTC
For anyone following this bug and wanting to learn more about interrupts, here is an explanation of masking interrupts in the IA-32 Intel® Architecture Software Developer’s Manual (dated 2001), a copy of which is at:
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&cad=rja&uact=8&ved=2ahUKEwi4_aOsjPrgAhUIvp4KHQ0DC2wQFjADegQICBAC&url=https%3A%2F%2Fwww.cs.cmu.edu%2F~410%2Fdoc%2Fintel-sys.pdf&usg=AOvVaw1g3zshJSuA3-7Y5mKO_ajJ


5.1.1.2. MASKABLE HARDWARE INTERRUPTS
Any external interrupt that is delivered to the processor by means of the INTR pin or through
the local APIC is called a maskable hardware interrupt. The maskable hardware interrupts
that can be delivered through the INTR pin include all IA-32 architecture defined interrupt
vectors from 0 through 255; those that can be delivered through the local APIC include interrupt
vectors 16 through 255.  [sheet 140]

Sheet 146 of the IA-32 Intel® Architecture Software Developer’s Manual has "5.6.1 Masking Maskable Hardware Interrupts"

I guess I'll explore the BIOS settings for my processor and see if there is any configuration which affects handling of interrupts and/or EUFI.
Comment 8 John L. Poole 2019-03-11 16:09:31 UTC
I altered a setting in BIOS:
Extended APIC from enabled to disabled
Boot up hung after:
(XEN) [2019-03-11 13:04:00] HVM: HAP page sizes: 4kB, 2MB
(XEN) [2019-03-11 13:03:56] masked ExtINT on CPU#1
Comment 9 John L. Poole 2019-03-11 18:06:59 UTC
I tried setting acpi_verbosity to its other setting of "debug" to see if there was any more information output around the hanging point.  

Conclusion:  no difference between "verbose" vs. "debug" for acpi_verbosity

Here are my two attempts (2nd I removed xpti parameter) and their final output:

(XEN) Command line: placeholder vga=gfx-1024x768x16 com1=115200,8n1,pci console=com1,vga console_timestamps console_to_ring conring_size=64 log_buf_len=16M loglvl=all guest_loglvl=all sync_console=true sched_debug iommu=verbose apic_verbosity=debug xpti=false no-real-mode edd=off


(XEN) [2019-03-11 18:00:42] HVM: HAP page sizes: 4kB, 2MB
(XEN) [2019-03-11 18:00:38] masked ExtINT on CPU#1
(XEN) [2019-03-11 18:00:38] masked ExtINT on CPU#2


(XEN) Command line: placeholder vga=gfx-1024x768x16 com1=115200,8n1,pci console=com1,vga console_timestamps console_to_ring conring_size=64 log_buf_len=16M loglvl=all guest_loglvl=all sync_console=true sched_debug iommu=verbose apic_verbosity=debug no-real-mode edd=off


(XEN) [2019-03-11 18:02:58] HVM: ASIDs enabled.
(XEN) [2019-03-11 18:02:58] HVM: VMX enabled
(XEN) [2019-03-11 18:02:58] HVM: Hardware Assisted Paging (HAP) detected
(XEN) [2019-03-11 18:02:58] HVM: HAP page sizes: 4kB, 2MB
Comment 10 John L. Poole 2019-03-11 18:13:36 UTC
I propose creating a patch for the kernel code which provide more details of events leading up to the hang.  I feel competent to insert print statements after important events in various *.c files, e.g. api.c. and setup.c.  It's been years since I've done something like this.  What I am intending on doing is creating a custom copy ebuild under /usr/local/portage... then having patches in a subdirectory.


The problem I am encountering is looking like something that should of interest to the Xen code maintainers, I'd like to make it as easy as possible for them to focus on this so a resolution or analysis can be made.

Suggestions?  Comments?
Comment 11 John L. Poole 2019-03-12 05:26:17 UTC
Tomas Mozes noted a search of "xen efi supermicro masked ExtINT on CPU#1" gives https://lists.xenproject.org/archives/html/xen-devel/2015-12/msg00653.html and possible work-around.
I tried adding "efi=no-rs" and "reboot=acpi" to the kernel command line and the boot
still hung.  Subsequent responses to the referenced  posting suggest using only the "reboot=acpi", so I
tried that alone, as well.  The result was the same: 

Below is a log including JLPDEBUG statements I added to isolate the point of failure.

 Xen 4.11.1
(XEN) Xen version 4.11.1 (@[unknown]) (x86_64-pc-linux-gnu-gcc (Gentoo 7.3.0-r3 p1.4) 7.3.0) debug=n  Mon Mar 11 20:57:43 PDT 2019
(XEN) Latest ChangeSet:
(XEN) Console output is synchronous.
(XEN) Bootloader: GRUB 2.02
(XEN) Command line: placeholder vga=gfx-1024x768x16 com1=115200,8n1,pci console=com1,vga console_timestamps console_to_ring conring_size=64 log_buf_len=16M loglvl=all guest_loglvl=all sync_console=true sched_debug iommu=verbose apic_verbosity=debug xpti=false no-real-mode edd=off efi=no-rs reboot=acpi
...
(XEN) [2019-03-12 05:04:21] HVM: Hardware Assisted Paging (HAP) detected
(XEN) [2019-03-12 05:04:21] HVM: HAP page sizes: 4kB, 2MB
(XEN) [2019-03-12 05:04:17] JLPDEBUG 527 starting setup_local_APIC()
(XEN) [2019-03-12 05:04:17] JLPDEBUG 535 after pounding w/big hammer.
(XEN) [2019-03-12 05:04:17] JLPDEBUG 550 after init_apc_ldr()
(XEN) [2019-03-12 05:04:17] JLPDEBUG 555 starting after apic_write
(XEN) [2019-03-12 05:04:17] JLPDEBUG 574 after for loop
(XEN) [2019-03-12 05:04:17] JLPDEBUG 627 after apic_write.
(XEN) [2019-03-12 05:04:17] masked ExtINT on CPU#1
(XEN) [2019-03-12 05:04:17] JLPDEBUG 649 after apic_write  CPU#1
(XEN) [2019-03-12 05:04:17] JLPDEBUG 658 after if  CPU#1
(XEN) [2019-03-12 05:04:17] JLPDEBUG 662 after apic_write  CPU#1
(XEN) [2019-03-12 05:04:17] JLPDEBUG 673 after apic_write  CPU#1
(XEN) [2019-03-12 05:04:17] JLPDEBUG 680 after apic_write  CPU#1
(XEN) [2019-03-12 05:04:17] JLPDEBUG 700 after apic_pm_activate()  CPU#1
(XEN) [2019-03-12 05:04:17] JLPDEBUG 527 starting setup_local_APIC()
(XEN) [2019-03-12 05:04:17] JLPDEBUG 535 after pounding w/big hammer.
(XEN) [2019-03-12 05:04:17] JLPDEBUG 550 after init_apc_ldr()
(XEN) [2019-03-12 05:04:17] JLPDEBUG 555 starting after apic_write
(XEN) [2019-03-12 05:04:17] JLPDEBUG 574 after for loop
(XEN) [2019-03-12 05:04:17] JLPDEBUG 627 after apic_write.
(XEN) [2019-03-12 05:04:17] masked ExtINT on CPU#2
(XEN) [2019-03-12 05:04:17] JLPDEBUG 649 after apic_write  CPU#2
(XEN) [2019-03-12 05:04:17] JLPDEBUG 658 after if  CPU#2
(XEN) [2019-03-12 05:04:17] JLPDEBUG 662 after apic_write  CPU#2
(XEN) [2019-03-12 05:04:17] JLPDEBUG 673 after apic_write  CPU#2
(XEN) [2019-03-12 05:04:17] JLPDEBUG 680 after apic_write  CPU#2
(XEN) [2019-03-12 05:04:17] JLPDEBUG 700 after apic_pm_activate()  CPU#2
(XEN) [2019-03-12 05:04:17] JLPDEBUG 527 starting setup_local_APIC()
(XEN) [2019-03-12 05:04:17] JLPDEBUG 535 after pounding w/big hammer.
(XEN) [2019-03-12 05:04:17] JLPDEBUG 550 after init_apc_ldr()
(XEN) [2019-03-12 05:04:17] JLPDEBUG 555 starting after apic_write
(XEN) [2019-03-12 05:04:17] JLPDEBUG 574 after for loop
(XEN) [2019-03-12 05:04:17] JLPDEBUG 627 after apic_write.
(XEN) [2019-03-12 05:04:17] masked ExtINT on CPU#3
(XEN) [2019-03-12 05:04:17] JLPDEBUG 649 after apic_write  CPU#3
(XEN) [2019-03-12 05:04:17] JLPDEBUG 658 after if  CPU#3
(XEN) [2019-03-12 05:04:17] JLPDEBUG 662 after apic_write  CPU#3
(XEN) [2019-03-12 05:04:17] JLPDEBUG 673 after apic_write  CPU#3
(XEN) [2019-03-12 05:04:17] JLPDEBUG 680 after apic_write  CPU#3
(XEN) [2019-03-12 05:04:17] JLPDEBUG 700 after apic_pm_activate()  CPU#3
[HUNG]

================= 2nd try with just "reboot=acpi" ========================

(XEN) Command line: placeholder vga=gfx-1024x768x16 com1=115200,8n1,pci console=com1,vga console_timestamps console_to_ring conring_size=64 log_buf_len=16M loglvl=all guest_loglvl=all sync_console=true sched_debug iommu=verbose apic_verbosity=debug xpti=false no-real-mode edd=off reboot=acpi
(XEN) Xen image load base address: 0

...
(XEN) JLPOOLEDEBUG_smpboot 1155 before connect_bsp_APIC()<2>JLPOOLEDEBUG_smpboot 1157 before setup_local_APIC()JLPDEBUG 527 starting setup_local_APIC()
(XEN) JLPDEBUG 535 after pounding w/big hammer.
(XEN) JLPDEBUG 550 after init_apc_ldr()
(XEN) JLPDEBUG 555 starting after apic_write
(XEN) JLPDEBUG 574 after for loop
(XEN) JLPDEBUG 627 after apic_write.
(XEN) enabled ExtINT on CPU#0
(XEN) JLPDEBUG 649 after apic_write  CPU#0
(XEN) JLPDEBUG 658 after if  CPU#0
(XEN) JLPDEBUG 662 after apic_write  CPU#0
(XEN) JLPDEBUG 673 after apic_write  CPU#0
(XEN) JLPDEBUG 680 after apic_write  CPU#0
(XEN) JLPDEBUG 700 after apic_pm_activate()  CPU#0
(XEN) JLPOOLEDEBUG_smpboot 1159 before setup_io_apic()ENABLING IO-APIC IRQs
(XEN)  -> Using new ACK method
...
(XEN) [2019-03-12 05:22:23] HVM: VMX enabled
(XEN) [2019-03-12 05:22:23] HVM: Hardware Assisted Paging (HAP) detected
(XEN) [2019-03-12 05:22:23] HVM: HAP page sizes: 4kB, 2MB
[HUNG]
Comment 12 John L. Poole 2019-03-12 14:24:57 UTC
I think I may have found the problem.

In /etc/grub.d/20_linux_xen starting at line 119 is a "sed" insert.

  sed "s/^/$submenu_indentation/" << EOF
	echo	'$(echo "$xmessage" | grub_quote)'
        if [ "\$grub_platform" = "pc" -o "\$grub_platform" = "" ]; then
            xen_rm_opts=
        else
            xen_rm_opts="no-real-mode edd=off"
            echo 'WARNING: JLPOOLE HACK of  /etc/grub.d/20_linux_xen since failed id of pc'
            xen_rm_opts=
        fi
	multiboot	${rel_xen_dirname}/${xen_basename} placeholder ${xen_args} \${xen_rm_opts}
	echo	'$(echo "$lmessage" | grub_quote)'
	module	${rel_dirname}/${basename} placeholder root=${linux_root_device_thisversion} ro ${args}
EOF

The above includes my modification of the "else" clause.  Early on in addressing this bug I noticed the the test for the "if" clause resulted in my grub_platform NOT being a "pc".  So the else clause was triggered and xen_rm_opts was being populated with two kernel parameters that then were included in my Xen kernel launch.  When the trial of the two additional parameters "efi=no-rs" and "reboot=acpi" did not affect anything, I later remembered that the entire kernel line invocation was tainted with the two values populated in xen_rm_opts.  So I went back an hacked /etc/grub.d/20_linux_xen to mimic the "then" clause.  So far I have had 3 successful boot-ups in a row just selected the:

*Gentoo GNU/Linux, with Xen hypervisor

menu option in grub without further editing/modification.

Therefore the absence of "no-real-mode edd=off" and the addition of "efi=no-rs reboot=acpi" seems to be working.  I'll try 3 more boot-ups to verify.

All of this is premised upon the fact that the test ("Grub Platform Test"):

"\$grub_platform" = "pc" -o "\$grub_platform" = ""

should equate to true which it does not on my system (UEFI).  Should the Grub Platform Test equate to "pd" or "" on an Intel Atom based procesor with UEFI?
Comment 13 John L. Poole 2019-03-12 14:34:17 UTC
Alas, I successfully rebooted, and then reset after the successfully getting past the posting of the ExtINT items, and the 2nd time the Xen kernel hung.

I then unplugged the unit and started afresh, and the Xen kernel hung.  Here's my latest log:

 Xen 4.11.1
(XEN) Xen version 4.11.1 (@[unknown]) (x86_64-pc-linux-gnu-gcc (Gentoo 7.3.0-r3 p1.4) 7.3.0) debug=n  Mon Mar 11 20:57:43 PDT 2019
(XEN) Latest ChangeSet:
(XEN) Console output is synchronous.
(XEN) Bootloader: GRUB 2.02
(XEN) Command line: placeholder vga=gfx-1024x768x16 com1=115200,8n1,pci console=com1,vga console_timestamps console_to_ring conring_size=64 log_buf_len=16M loglvl=all guest_loglvl=all sync_console=true sched_debug iommu=verbose apic_verbosity=verbose efi=no-rs reboot=acpi
(XEN) Xen image load base address: 0
...
(XEN) [2019-03-12 14:30:56] HVM: VMX enabled
(XEN) [2019-03-12 14:30:56] HVM: Hardware Assisted Paging (HAP) detected
(XEN) [2019-03-12 14:30:56] HVM: HAP page sizes: 4kB, 2MB
[HUNG]
Comment 14 John L. Poole 2019-03-13 01:41:35 UTC
Created attachment 568922 [details]
Crib Notes For Making/Deploying Patch

Here are crib notes for creating the patch I made to add debug statements for xen-4.11.1-rc1
Comment 15 John L. Poole 2019-03-13 01:43:52 UTC
Created attachment 568924 [details, diff]
Debug Patch with JLPOOLEDEBUG

Here's a current patch as of March 12, 2019. I'm awaiting further word from the Xen Mailing list as to what other files the failure point could be in after completing apic.c's function.  See https://lists.xenproject.org/archives/html/xen-users/2019-03/msg00026.html
Comment 16 John L. Poole 2019-03-13 02:07:09 UTC
Created attachment 568926 [details]
output of dmidecode  (637 lines)
Comment 17 John L. Poole 2019-03-13 02:07:59 UTC
Created attachment 568928 [details]
lspci -vvv output (513 lines)
Comment 18 John L. Poole 2019-03-13 02:21:49 UTC
Created attachment 568930 [details]
Xen Console All Diagnostics [ key '*' (ascii '2a') => print all diagnostics]

I successfully boot and then in my serial console pressed Control-A thrice entering into the Xen Console.  I then pushed "h" for help and later "*" for a complete "all" diagnostics output.  Moments after the all diagnostics (line "...done"), the server on its own accord through watchdog rebooted:


(XEN) [2019-03-13 02:14:20] .................................... done.
(XEN) [2019-03-13 02:14:39] Watchdog timer fired for domain 0
(XEN) [2019-03-13 02:14:39] Hardware Dom0 shutdown: watchdog rebooting machine
Comment 19 Jeroen Roovers (RETIRED) gentoo-dev 2019-03-15 20:30:58 UTC
*** Bug 680472 has been marked as a duplicate of this bug. ***
Comment 20 John L. Poole 2019-03-15 21:03:55 UTC
I've logged Bug #680472 which arises from kernel 4.12.0.  

This bug, Bug #679826, arises from kernel 4.11.1.

Each bug relates to a specific kernel and while 4.12.0 was marked RESOLVED, that status does not accurately depict the status of kernel 4.12.0.  Since I'm running on 4.12.0 now, I'll be updating #680472 so as to keep the outputs derived from each kernel isolated.
Comment 21 Tomáš Mózes 2019-03-15 21:25:56 UTC
Have you also tested linux kernel 4.14?
Comment 22 John L. Poole 2019-03-15 21:28:33 UTC
(In reply to Tomáš Mózes from comment #21)
> Have you also tested linux kernel 4.14?

I have not.  I can do so if you indicate it will be helpful.  I'd just take the 4.12.0_rc4 ebuilds and change the versions and cross my fingers.  I would also like to know if I should then remain on 4.14, or revert back to 4.12.0 or whatever.
Comment 23 Tomáš Mózes 2019-03-15 21:32:17 UTC
Do you pass all the dom0 kernel requirements? 

https://wiki.xenproject.org/wiki/Mainline_Linux_Kernel_Configs
Comment 24 Tomáš Mózes 2019-03-15 21:34:59 UTC
(In reply to John L. Poole from comment #22)
> (In reply to Tomáš Mózes from comment #21)
> > Have you also tested linux kernel 4.14?
> 
> I have not.  I can do so if you indicate it will be helpful.  I'd just take
> the 4.12.0_rc4 ebuilds and change the versions and cross my fingers.  I
> would also like to know if I should then remain on 4.14, or revert back to
> 4.12.0 or whatever.

I mean the linux kernel, not xen.
Comment 25 Tomáš Mózes 2019-03-15 21:38:31 UTC
I don't use kernel 4.19 yet, but have a few machines on 4.14 lts.

Please note the difference between the xen versions (4.10, 4.11, 4.12) and the linux kernel versions (4.14, 4.19, 4.20, 5.0).
Comment 26 John L. Poole 2019-03-15 21:44:51 UTC
This expert analysis just came in from a Xen Developer Andrew Cooper at Citrix:

conclusion: ...the root of your problem is that Xen can't find the ACPI tables, which is either going to be a grub or a Xen build misconfiguration.

Further discussion:

So the first problem is that there aren't any APCI tables to be found.  You're presumably booting in EFI mode, but either Grub hasn't handed the SystemTable/etc to Xen, or Xen wasn't built with an EFI-capable toolchain and isn't capable of receiving them via the extended multiboot2 protocol.  One way or another, this is the root of the problem.

Complete email at:  https://lists.xenproject.org/archives/html/xen-devel/2019-03/msg01279.html

I'm going to hold off pursuing kernel changes as proposed in the last hour and consider Mr. Cooper's analysis and see what I can make of it.
Comment 27 John L. Poole 2019-03-16 00:31:06 UTC
For posterity, here is the version of grub I've been working with:
zeta /home/jlpoole # eix grub -I
[I] sys-boot/grub
     Available versions:  (2) 2.02-r1(2/2.02-r1)^st ~2.02-r2(2/2.02-r2)^st ~2.02-r3(2/2.02-r3)^st **9999(2/9999)^st
       {debug device-mapper doc efiemu +fonts libzfs mount multislot nls sdl static test +themes truetype GRUB_PLATFORMS="coreboot efi-32 efi-64 emu ieee1275 loongson multiboot pc qemu qemu-mips uboot xen xen-32"}
     Installed versions:  2.02-r1(2/2.02-r1)^st(11:16:22 PM 03/02/2019)(fonts nls themes -debug -device-mapper -doc -efiemu -libzfs -mount -multislot -sdl -static -test -truetype GRUB_PLATFORMS="efi-64 pc xen -coreboot -efi-32 -emu -ieee1275 -loongson -multiboot -qemu -qemu-mips -uboot -xen-32")
     Homepage:            https://www.gnu.org/software/grub/
     Description:         GNU GRUB boot loader

zeta /home/jlpoole # 

Since Andrew Cooper has indicated there may be a grub issue afoot plus the fact that two years ago grub was not ready for UEFI xen support, I'm going to focus on grub for the moment.  It seems to me I probably should have "multiboot", though the problem is happening solely within the early stages of the Xen kernel.  Could someone from the Gentoo Xen team indicate their use flags for grub?
Comment 28 Tomáš Mózes 2019-03-16 01:03:44 UTC
I have a single uefi system:

sys-boot/grub-2.02-r3::gentoo was built with the following:
USE="-debug -device-mapper -doc -efiemu -fonts -libzfs -mount -multislot -nls -sdl -static (-test) -themes -truetype" ABI_X86="(64)" GRUB_PLATFORMS="efi-64 -coreboot -efi-32 -emu -ieee1275 -loongson -multiboot -pc -qemu -qemu-mips -uboot -xen -xen-32"
CFLAGS=""
LDFLAGS=""

app-emulation/xen-4.10.3-r1::gentoo was built with the following:
USE="efi -custom-cflags -debug -flask" ABI_X86="(64)"
CFLAGS=""
LDFLAGS=""

app-emulation/xen-tools-4.10.3-r1::gentoo was built with the following:
USE="hvm pam qemu qemu-traditional screen -api -custom-cflags -debug -doc -flask -ocaml -ovmf -pygrub -python -sdl -static-libs -system-qemu -system-seabios" ABI_X86="(64)" PYTHON_TARGETS="python2_7"
CFLAGS="-fno-strict-overflow"
CXXFLAGS="-mtune=native -O2 -pipe -fno-strict-overflow"
LDFLAGS=""
Comment 29 John L. Poole 2019-03-16 02:05:31 UTC
Upgraded grub:
zeta /home/jlpoole # eix -I grub
[I] sys-boot/grub
     Available versions:  (2) 2.02-r1(2/2.02-r1)^st ~2.02-r2(2/2.02-r2)^st (~)2.02-r3(2/2.02-r3)^st **9999(2/9999)^st
       {debug device-mapper doc efiemu +fonts libzfs mount multislot nls sdl static test +themes truetype GRUB_PLATFORMS="coreboot efi-32 efi-64 emu ieee1275 loongson multiboot pc qemu qemu-mips uboot xen xen-32"}
     Installed versions:  2.02-r3(2/2.02-r3)^st(06:47:25 PM 03/15/2019)(fonts nls themes -debug -device-mapper -doc -efiemu -libzfs -mount -multislot -sdl -static -test -truetype GRUB_PLATFORMS="efi-64 pc xen -coreboot -efi-32 -emu -ieee1275 -loongson -multiboot -qemu -qemu-mips -uboot -xen-32")
     Homepage:            https://www.gnu.org/software/grub/
     Description:         GNU GRUB boot loader

zeta /home/jlpoole #

Tried rebooting twice.  Same problem.  

To clarify, I think Andrew Cooper meant "ACPI", not "APCI" when he wrote "aren't any APCI tables".  There seems to be some recent threads re: ACPI on the Xen List, so I'm going to research those.  In https://lists.xenproject.org/archives/html/xen-devel/2018-03/msg00524.html Andrew Cooper suggests "Upgrade Grub to 2.02".  My previous grub was 2.02-r1, now I am at r3.
Comment 30 Tomáš Mózes 2019-03-16 02:36:24 UTC
(In reply to John L. Poole from comment #29)
> In https://lists.xenproject.org/archives/html/xen-devel/2018-03/msg00524.html
> Andrew Cooper suggests "Upgrade Grub to 2.02".  My previous grub was
> 2.02-r1, now I am at r3.

I remember installing a new HP DL360 Gen9 server 2 years ago with efi and it only worked while booting a normal kernel, however under Xen only 1 cpu was reported. That's why i switched back to legacy boot and it worked fine.
Comment 31 John L. Poole 2019-03-16 03:22:16 UTC
OK  the easier path now is to explore if my hardware will give me the opportunity to be in non-EUFI mode.  I'm certain I went through this exercise, but did not document it.
Comment 32 John L. Poole 2019-03-16 03:52:06 UTC
I was unable to find any setting in the BIOS menu that allows me to be in a mode other than UEFI.

Moreover, I consulted the Supermicro Manual (https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=2ahUKEwiOqM3R4IXhAhWE-lQKHWE1D_MQFjABegQIABAC&url=https%3A%2F%2Fwww.supermicro.com%2Fmanuals%2Fmotherboard%2FAtom_on-chip%2FMNL-1568.pdf&usg=AOvVaw168wtPJE65dUbO8SsJMEgy) for "B1SA4-2750F
B1SA4-2550F" and there is no mention of switching BIOS or legacy BIOS.  In fact, they state "The B1SA4-2750F/B1SA4-2550F Motherboard is a micro cloud motherboard optimized for the Supermicro Microblade chassis."  Ironically, the also state: "This product is intended to be installed and serviced by professional technicians."  I'm resigned to the fact that I am not a professional.  :(
Comment 33 John L. Poole 2019-03-16 13:10:32 UTC
Progress:  I have resorted to launching the Xen kernel from an EFI console. To get to the EFI console, I've been letting grub load and then "c" for command line, and "exit" to exit grub which drops me into an EFI console session.  One can go directly to an EFI console session before entering grub.

Here are some highlights of what I have learned:
1) the serial console set-up I have (using Windows PuTTY 7.7.0.40 [2019] -- yes, you have to "contribute" to obtain a recent version) seems to be introducing invisible non-ASCII characters, e.g. \177, into the text.  These characters could have been coming from my Windows session of Notepad++ or Emacs from a regular Gentoo kernel session, probably the former.  So I ended up using the keyboard connected to the USB port on the Xen server to assure no introduction of extraneous invisible characters.  Also, I do have a USB extended cable and the cable may be causing some problems.

2) This is really quirky. I kept getting the error message below when executing a command, i.e. xen-4.12.0-rc.efi -cfg=xen.cfg, that I thought had previously worked years ago.  I'm certain I had configuration files like "jp.cfg", but no matter what I tried, I kept getting the "No configuration file found" from the just launched xen-4.12.0-rc.efi  The error message may have been generated because of non-ASCII characters in the command line which I later discovered in an editing session using nano.  


[31;1H[1;33;40mfs0:\efi\gentoo> xen-4.12.0-rc.efi -cfg=xen.cfg[31;48H[0;37;40m
Xen 4.12.0-rc (c/s ) EFI loader
No configuration file found.

At any rate, I finally edited in the EFI console (for EFI commands, see https://software.intel.com/en-us/articles/efi-shells-and-scripting/) a file named xen.cfg and I launched the command "xen-4.12.0-rc.efi " just by itself and with no "-cfg=..." specification and let its built-in search facility to find a configuration file in the same directory do its job and finally got past the "No configuration file found."  Heed this: "To illustrate the name handling, a binary named xen-4.2-unstable.efi would try xen-4.2-unstable.cfg, xen-4.2.cfg, xen-4.cfg, and xen.cfg in order." from http://xenbits.xenproject.org/docs/unstable/misc/efi.html



A successful load of the kernel has this output on the serial console, and then no more:
Xen 4.12.0-rc (c/s ) EFI loader
Using configuration file 'xen.cfg'
xen-4.12.0-rc.gz: 0x000000005ad2b000-0x000000005ae49573
0x0000:0x02:0x00.0x0: ROM: 0x8000 bytes at 0x7c8bc028


Everything thing else, e.g. the (XEN) postings, just goes to the console attached to the Xen server which cannot be routed to a log file.  Note, I do not have this problem when launching from grub, so the serial console settings within the xen.cfg are not set correctly and I'll have to sort that out.

3) I get the same inconsistent stopping points when loading the Xen kernel from EFI.  Since I cannot capture the output as it scrolls by, I have only the final postings to compare.  But I'm guessing in the output was the warning Andrew Cooper noted that indicated that the kernel cannot find the correct table.  This is important because it demonstrates that the Xen kernel I have built is having the problem and since grub is not part of the equation.

I'm going to now review the Xen kernel configuration and logs of the xen-tools emerge which builds the kernel (I believe because the app-emulation/xen-tools log are so large and the app-emulation/xen log is only a few line).
Comment 34 John L. Poole 2019-03-17 05:28:19 UTC
Created attachment 569408 [details]
Boot Log (Unsuccessful) from EFI Console

The error message Andrew Cooper identified from the grub2 boot attempt:

   (XEN) ACPI Error (tbxfroot-0217): A valid RSDP was not found [20070126]

is generated by xen / drivers / acpi / tables / tbxfroot.c at line 217.

See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/drivers/acpi/tables/tbxfroot.c;h=18e5ad6e5a18804d80434354425dd0b7bb224e76;hb=HEAD#l217

ACPI is Advanced Configuration and Power Interface (ACPI) -- see wikipedia.  It is a Power Management and configuration standard for the PC, developed by Intel, Microsoft and Toshiba.  https://wiki.osdev.org/ACPI
To begin using ACPI, the operating system must look for the RSDP (Root System Description Pointer). 
A RSDP is Root System Description Pointer.  See https://wiki.osdev.org/RSDP, especially https://wiki.osdev.org/RSDP#Detecting_the_RSDP

Line 217 is the line before the final return of tbxfroot's function acpi_tb_scan_memory_for_rsdp(u8 * start_address, u32 length).  After 2 attempts to locate the root ACPI table (RSDT) the function prints this warning/error message.

In a previous posting, I concluded that the boot by EFI caused the same problem without grub2 and that therefore the problem must be in the Xen kernel.  Unfortunately, at that time, when I booted using the EFI command line, all the print-outs went only to the console attached to the server and my serial port session remained blank after the launch of the Xen kernal.  I therefore did not have showing on my console the above message "A valid RSDP was not found" because it had flown off the screen, but I found my system hanging at the same location.  I made an error jumping to the previous conclusion, and suspecting so, I have engaged on getting Xen kernel output (booted from the EFI command line) to my serial console so I can capture it into a log on my Windows PuTTY session.  I have achieved that now.  The EFI boot session which I am attached does *NOT* contain the "A valid RSDP was not found". I realized I had copied the gentoo kernel parameters to the EFI configuration one-to-one.  It turns out the kernel parameters to the Xen kernel are different than the ones to the Gentoo kernel.  I therefore looked at each parameter and checked it against the documented Xen ones published at https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html#apic-x86
and added an options line before the gentoo kernel line.

The result is I have on my boot directory the following:

    zeta /home/jlpoole # ls -la /boot/efi/gentoo
    total 12778
    drwxr-xr-x 3 root root    6144 Mar 17  2019  .
    drwxr-xr-x 5 root root     512 Mar 17  2019  ..
    drwxr-xr-x 2 root root    1024 Mar 15 21:05  attic
    -rwxr-xr-x 1 root root 8919936 Mar  6 08:05  initramfs-genkernel-x86_64-4.19.23-gentoo
    -rwxr-xr-x 1 root root     368 Mar 15 21:04  jp.conf
    -rwxr-xr-x 1 root root     368 Mar 15 21:05 '#jp.config#'
    -rwxr-xr-x 1 root root     368 Mar 15 21:04  jp.config
    -rwxr-xr-x 1 root root 2980978 Mar 15 21:02  xen-4.12.0-rc.efi
    -rwxr-xr-x 1 root root 1172851 Mar 15 21:03  xen-4.12.0-rc.gz
    -rwxr-xr-x 1 root root     354 Mar 17  2019  xen.cfg
    -rwxr-xr-x 1 root root     368 Mar 16 05:27  xen.cfg.WORKS
    zeta /home/jlpoole #

Note: the xen-4.12.0-rc.efi was placed by app-emulation/xen under /usr/libe64/xen:

    zeta /home/jlpoole # ls -la /usr/lib64/efi
    total 2956
    drwxr-xr-x  2 root root    4096 Mar 14 20:58 .
    drwxr-xr-x 44 root root   36864 Mar 14 20:55 ..
    -rw-r--r--  1 root root 2980978 Mar 14 20:58 xen-4.12.0-rc.efi
    lrwxrwxrwx  1 root root      17 Mar 14 20:58 xen-4.12.efi -> xen-4.12.0-rc.efi
    lrwxrwxrwx  1 root root      17 Mar 14 20:58 xen-4.efi -> xen-4.12.0-rc.efi
    lrwxrwxrwx  1 root root      17 Mar 14 20:58 xen.efi -> xen-4.12.0-rc.efi
    zeta /home/jlpoole #


The xen.cfg file (recall the program is looking for this particularly named
file "xen.cfg" and the attempt to use -cfg=my.cfg failed with "not found")
has this in its contents:

    zeta /home/jlpoole # cat -n /boot/efi/gentoo/xen.cfg
         1  [global]
         2  default=abc
         3
         4  [abc]
         5  options=console=vga,com1 com1=115200,8n1
         6  kernel=xen-4.12.0-rc.gz  root=/dev/sda4  vga=gfx-1024x768x16  com1=115200,8n1 console=com1 console_timestamps=date console_to_ring conring_size=16k  loglvl=all guest_loglvl=all sync_console=true iommu=debug apic_verbosity=debug
         7  initramfs=initramfs-genkernel-x86_64-4.19.23-gentoo
         8
         9
    zeta /home/jlpoole #

The point of all of the above is that under the EFI boot attempt where the
Xen kernel hangs at the EXTint points, the EFI boot log does not contain the
error message "RSDP was not found".  Thus, different boot messages are being posted dependening upon whether it is grub2 or EFI. I also am seeing on my regular Gentoo kernel boot log (which can successfully boot) ACPI output, so I can contrast the successful Gentoo keneral output with the sporadic Grub2 and EFI boot attempts and perhaps hone in on where the locating of the Root System Description Pointer.

Vogue la galère!
Comment 35 John L. Poole 2019-03-21 17:13:24 UTC
I tried building from the Xen source once I saw the Gentoo build applied patches.  I run into the same problems, but without the error Andrew Cooper focused on, and have contacted the XEN-DEVEL list at
https://lists.xenproject.org/archives/html/xen-devel/2019-03/msg01691.html
Comment 36 Tomáš Mózes 2019-04-03 17:15:54 UTC
Any progress on this?
Comment 37 John L. Poole 2019-04-03 17:30:22 UTC
I've reached a point where Jan Beulich is "out of ideas for the moment."  https://lists.xenproject.org/archives/html/xen-devel/2019-03/msg01976.html

I've compared Gentoo's Linux apic.c vs. the Xen Projects and there are many differences.  For instance, there is a macro "if" statement with a value of 1 at line 605 and I am not understanding why this hard override was being implemented.  I have to build a patch when monitors each step so I can see if the hanging is occurring at a point where the two codes diverge.

I have two courses to pursue 1) download an old Gentoo ebuild around version 7 which did load two years ago (through an EFI console) and then compare the apic.c and calling codes to see what has changed.   2) debug the existing and Gentoo's with print statements and contrast.

Both of these endeavors are going to take at least 4 hours and I have not had the time and energy to undertake this at this time.
Comment 38 John L. Poole 2019-09-26 00:27:28 UTC
Short version: Windows wireless USB keyboard hardware incompatibility caused the problem. 

The Take-away: a USB keyboard can affect the boot for the xen kernel

This was a hardware caused problem.

Long version:

I had several critical matters that I could not postpone so my work
on this was suspended since May.  I finally had time to resume work on this problem.  Recall,
I could successfully boot a Gentoo kernel, but when I tried a Xen kernel, the system would
hand early on at the masking of the CPUs.

By chance, I decided to swap out the USB keyboard "Microsoft Wireless Desktop Receiver 3.1" model: 1028,
because I had to keep replacing batteries and the range was very limited, e.g. 15",
and characters were dropping out.  I replaced it with a generic Amazon USB keyboard. 
Suddenly the boot problems went away: no more hanging at the CPU masking point.

I sailed throught and successfully booted.  Moreover, I had placed in a new hard disk
in the server, disengaged the exsting one, and installed the
Debian version, 8.6.0 of 11/8/2016, I first used to test this server so I had an apples-to-apples
test case before I returned this for service under warranty, and the installation while occurring,
had video artifacts the prohibited the graphic install and dropped me into a console install
with colorations that caused invisible selections. After I installed the Debian 8.6.0, I had the
same problem -- I could not get past the "masked ExtINT on CPU#..."

Since this discovery several days ago, I have booted my various xen kernels (in EFI) and have
not encountered any of the problems I previously suffered.  While I do have some other issues
that relate to Gentoo specific tweaks, I am not concerned and I wanted to close this issue
by reporting this discovery.  Of course, I can make available the USB unit to qualified persons if
they want to test or I can affix it to the server to test a debugging version.
Comment 39 Tomáš Mózes 2019-09-26 04:35:55 UTC
Oh, haven't thought about this possibility, although it's true I had this issue in the past, but not only with xen.

Maybe try posting your findings to the xen mailing list so they'll decide whether to continue with the investigation. It would be best to continue your previous thread where Jan Beulich was "out of ideas".
Comment 40 John L. Poole 2019-09-26 04:40:03 UTC
I did post to the mailing list at the same time with the same text.

It never occurred to me that a USB product such as a keyboard could do anything to interfere with the low level set-up of processors.  This kind of failure ought to be publicized and the first question out of the box to people have problems at the very initial start-up is: what hardware do you have attached to the USB ports as that can affect the kernel start-up.

A very very expensive lesson.  This cost me about 7 months and is a story I'll tell other people's grandchildren.