================================================================================
7. Capabilities that can be enabled on VMs
================================================================================

There are certain capabilities that change the behavior of the virtual
machine when enabled. To enable them, please see section 4.37. Below
you can find a list of capabilities and what their effect on the VM
is when enabling them.

The following information is provided along with the description:

  Architectures: which instruction set architectures provide this ioctl.
      x86 includes both i386 and x86_64.

  Parameters: what parameters are accepted by the capability.

  Returns: the return value.  General error numbers (EBADF, ENOMEM, EINVAL)
      are not detailed, but errors with specific meanings are.

--------------------------------------------------------------------------------
7.1 KVM_CAP_PPC_ENABLE_HCALL
--------------------------------------------------------------------------------

Architectures: ppc
Parameters: args[0] is the sPAPR hcall number
	    args[1] is 0 to disable, 1 to enable in-kernel handling

This capability controls whether individual sPAPR hypercalls (hcalls)
get handled by the kernel or not.  Enabling or disabling in-kernel
handling of an hcall is effective across the VM.  On creation, an
initial set of hcalls are enabled for in-kernel handling, which
consists of those hcalls for which in-kernel handlers were implemented
before this capability was implemented.  If disabled, the kernel will
not to attempt to handle the hcall, but will always exit to userspace
to handle it.  Note that it may not make sense to enable some and
disable others of a group of related hcalls, but KVM does not prevent
userspace from doing that.

If the hcall number specified is not one that has an in-kernel
implementation, the KVM_ENABLE_CAP ioctl will fail with an EINVAL
error.

--------------------------------------------------------------------------------
7.2 KVM_CAP_S390_USER_SIGP
--------------------------------------------------------------------------------

Architectures: s390
Parameters: none

This capability controls which SIGP orders will be handled completely in user
space. With this capability enabled, all fast orders will be handled completely
in the kernel:
- SENSE
- SENSE RUNNING
- EXTERNAL CALL
- EMERGENCY SIGNAL
- CONDITIONAL EMERGENCY SIGNAL

All other orders will be handled completely in user space.

Only privileged operation exceptions will be checked for in the kernel (or even
in the hardware prior to interception). If this capability is not enabled, the
old way of handling SIGP orders is used (partially in kernel and user space).

--------------------------------------------------------------------------------
7.3 KVM_CAP_S390_VECTOR_REGISTERS
--------------------------------------------------------------------------------

Architectures: s390
Parameters: none
Returns: 0 on success, negative value on error

Allows use of the vector registers introduced with z13 processor, and
provides for the synchronization between host and user space.  Will
return -EINVAL if the machine does not support vectors.

--------------------------------------------------------------------------------
7.4 KVM_CAP_S390_USER_STSI
--------------------------------------------------------------------------------

Architectures: s390
Parameters: none

This capability allows post-handlers for the STSI instruction. After
initial handling in the kernel, KVM exits to user space with
KVM_EXIT_S390_STSI to allow user space to insert further data.

Before exiting to userspace, kvm handlers should fill in s390_stsi field of
vcpu->run:

.. code-block:: c

   struct {
           __u64 addr;
           __u8 ar;
           __u8 reserved;
           __u8 fc;
           __u8 sel1;
           __u16 sel2;
   } s390_stsi;

@addr - guest address of STSI SYSIB
@fc   - function code
@sel1 - selector 1
@sel2 - selector 2
@ar   - access register number

KVM handlers should exit to userspace with rc = -EREMOTE.

--------------------------------------------------------------------------------
7.5 KVM_CAP_SPLIT_IRQCHIP
--------------------------------------------------------------------------------

Architectures: x86
Parameters: args[0] - number of routes reserved for userspace IOAPICs
Returns: 0 on success, -1 on error

Create a local apic for each processor in the kernel. This can be used
instead of KVM_CREATE_IRQCHIP if the userspace VMM wishes to emulate the
IOAPIC and PIC (and also the PIT, even though this has to be enabled
separately).

This capability also enables in kernel routing of interrupt requests;
when KVM_CAP_SPLIT_IRQCHIP only routes of KVM_IRQ_ROUTING_MSI type are
used in the IRQ routing table.  The first args[0] MSI routes are reserved
for the IOAPIC pins.  Whenever the LAPIC receives an EOI for these routes,
a KVM_EXIT_IOAPIC_EOI vmexit will be reported to userspace.

Fails if VCPU has already been created, or if the irqchip is already in the
kernel (i.e. KVM_CREATE_IRQCHIP has already been called).

--------------------------------------------------------------------------------
7.6 KVM_CAP_S390_RI
--------------------------------------------------------------------------------

Architectures: s390
Parameters: none

Allows use of runtime-instrumentation introduced with zEC12 processor.
Will return -EINVAL if the machine does not support runtime-instrumentation.
Will return -EBUSY if a VCPU has already been created.

--------------------------------------------------------------------------------
7.7 KVM_CAP_X2APIC_API
--------------------------------------------------------------------------------

Architectures: x86
Parameters: args[0] - features that should be enabled
Returns: 0 on success, -EINVAL when args[0] contains invalid features

Valid feature flags in args[0] are

.. code-block:: c

   #define KVM_X2APIC_API_USE_32BIT_IDS            (1ULL << 0)
   #define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK  (1ULL << 1)

Enabling KVM_X2APIC_API_USE_32BIT_IDS changes the behavior of
KVM_SET_GSI_ROUTING, KVM_SIGNAL_MSI, KVM_SET_LAPIC, and KVM_GET_LAPIC,
allowing the use of 32-bit APIC IDs.  See KVM_CAP_X2APIC_API in their
respective sections.

KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK must be enabled for x2APIC to work
in logical mode or with more than 255 VCPUs.  Otherwise, KVM treats 0xff
as a broadcast even in x2APIC mode in order to support physical x2APIC
without interrupt remapping.  This is undesirable in logical mode,
where 0xff represents CPUs 0-7 in cluster 0.

--------------------------------------------------------------------------------
7.8 KVM_CAP_S390_USER_INSTR0
--------------------------------------------------------------------------------

Architectures: s390
Parameters: none

With this capability enabled, all illegal instructions 0x0000 (2 bytes) will
be intercepted and forwarded to user space. User space can use this
mechanism e.g. to realize 2-byte software breakpoints. The kernel will
not inject an operating exception for these instructions, user space has
to take care of that.

This capability can be enabled dynamically even if VCPUs were already
created and are running.

--------------------------------------------------------------------------------
7.9 KVM_CAP_S390_GS
--------------------------------------------------------------------------------

Architectures: s390
Parameters: none
Returns: 0 on success; -EINVAL if the machine does not support
	 guarded storage; -EBUSY if a VCPU has already been created.

Allows use of guarded storage for the KVM guest.

--------------------------------------------------------------------------------
7.10 KVM_CAP_S390_AIS
--------------------------------------------------------------------------------

Architectures: s390
Parameters: none

Allow use of adapter-interruption suppression.
Returns: 0 on success; -EBUSY if a VCPU has already been created.

--------------------------------------------------------------------------------
7.11 KVM_CAP_PPC_SMT
--------------------------------------------------------------------------------

Architectures: ppc
Parameters: vsmt_mode, flags

Enabling this capability on a VM provides userspace with a way to set
the desired virtual SMT mode (i.e. the number of virtual CPUs per
virtual core).  The virtual SMT mode, vsmt_mode, must be a power of 2
between 1 and 8.  On POWER8, vsmt_mode must also be no greater than
the number of threads per subcore for the host.  Currently flags must
be 0.  A successful call to enable this capability will result in
vsmt_mode being returned when the KVM_CAP_PPC_SMT capability is
subsequently queried for the VM.  This capability is only supported by
HV KVM, and can only be set before any VCPUs have been created.
The KVM_CAP_PPC_SMT_POSSIBLE capability indicates which virtual SMT
modes are available.

--------------------------------------------------------------------------------
7.12 KVM_CAP_PPC_FWNMI
--------------------------------------------------------------------------------

Architectures: ppc
Parameters: none

With this capability a machine check exception in the guest address
space will cause KVM to exit the guest with NMI exit reason. This
enables QEMU to build error log and branch to guest kernel registered
machine check handling routine. Without this capability KVM will
branch to guests' 0x200 interrupt vector.

--------------------------------------------------------------------------------
7.13 KVM_CAP_X86_DISABLE_EXITS
--------------------------------------------------------------------------------

Architectures: x86
Parameters: args[0] defines which exits are disabled
Returns: 0 on success, -EINVAL when args[0] contains invalid exits

Valid bits in args[0] are

.. code-block:: c

   #define KVM_X86_DISABLE_EXITS_MWAIT            (1 << 0)
   #define KVM_X86_DISABLE_EXITS_HLT              (1 << 1)

Enabling this capability on a VM provides userspace with a way to no
longer intercept some instructions for improved latency in some
workloads, and is suggested when vCPUs are associated to dedicated
physical CPUs.  More bits can be added in the future; userspace can
just pass the KVM_CHECK_EXTENSION result to KVM_ENABLE_CAP to disable
all such vmexits.

Do not enable KVM_FEATURE_PV_UNHALT if you disable HLT exits.

--------------------------------------------------------------------------------
7.14 KVM_CAP_S390_HPAGE_1M
--------------------------------------------------------------------------------

Architectures: s390
Parameters: none
Returns: 0 on success, -EINVAL if hpage module parameter was not set
	 or cmma is enabled, or the VM has the KVM_VM_S390_UCONTROL
	 flag set

With this capability the KVM support for memory backing with 1m pages
through hugetlbfs can be enabled for a VM. After the capability is
enabled, cmma can't be enabled anymore and pfmfi and the storage key
interpretation are disabled. If cmma has already been enabled or the
hpage module parameter is not set to 1, -EINVAL is returned.

While it is generally possible to create a huge page backed VM without
this capability, the VM will not be able to run.

--------------------------------------------------------------------------------
7.15 KVM_CAP_MSR_PLATFORM_INFO
--------------------------------------------------------------------------------

Architectures: x86
Parameters: args[0] whether feature should be enabled or not

With this capability, a guest may read the MSR_PLATFORM_INFO MSR. Otherwise,
a #GP would be raised when the guest tries to access. Currently, this
capability does not enable write permissions of this MSR for the guest.

--------------------------------------------------------------------------------
7.16 KVM_CAP_PPC_NESTED_HV
--------------------------------------------------------------------------------

Architectures: ppc
Parameters: none
Returns: 0 on success, -EINVAL when the implementation doesn't support
	 nested-HV virtualization.

HV-KVM on POWER9 and later systems allows for "nested-HV"
virtualization, which provides a way for a guest VM to run guests that
can run using the CPU's supervisor mode (privileged non-hypervisor
state).  Enabling this capability on a VM depends on the CPU having
the necessary functionality and on the facility being enabled with a
kvm-hv module parameter.

--------------------------------------------------------------------------------
7.17 KVM_CAP_EXCEPTION_PAYLOAD
--------------------------------------------------------------------------------

Architectures: x86
Parameters: args[0] whether feature should be enabled or not

With this capability enabled, CR2 will not be modified prior to the
emulated VM-exit when L1 intercepts a #PF exception that occurs in
L2. Similarly, for kvm-intel only, DR6 will not be modified prior to
the emulated VM-exit when L1 intercepts a #DB exception that occurs in
L2. As a result, when KVM_GET_VCPU_EVENTS reports a pending #PF (or
#DB) exception for L2, exception.has_payload will be set and the
faulting address (or the new DR6 bits*) will be reported in the
exception_payload field. Similarly, when userspace injects a #PF (or
#DB) into L2 using KVM_SET_VCPU_EVENTS, it is expected to set
exception.has_payload and to put the faulting address (or the new DR6
bits*) in the exception_payload field.

This capability also enables exception.pending in struct
kvm_vcpu_events, which allows userspace to distinguish between pending
and injected exceptions.

* For the new DR6 bits, note that bit 16 is set iff the #DB exception
  will clear DR6.RTM.

--------------------------------------------------------------------------------
7.18 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT
--------------------------------------------------------------------------------

Architectures: all
Parameters: args[0] whether feature should be enabled or not

With this capability enabled, KVM_GET_DIRTY_LOG will not automatically
clear and write-protect all pages that are returned as dirty.
Rather, userspace will have to do this operation separately using
KVM_CLEAR_DIRTY_LOG.

At the cost of a slightly more complicated operation, this provides better
scalability and responsiveness for two reasons.  First,
KVM_CLEAR_DIRTY_LOG ioctl can operate on a 64-page granularity rather
than requiring to sync a full memslot; this ensures that KVM does not
take spinlocks for an extended period of time.  Second, in some cases a
large amount of time can pass between a call to KVM_GET_DIRTY_LOG and
userspace actually using the data in the page.  Pages can be modified
during this time, which is inefficint for both the guest and userspace:
the guest will incur a higher penalty due to write protection faults,
while userspace can see false reports of dirty pages.  Manual reprotection
helps reducing this time, improving guest performance and reducing the
number of dirty log false positives.