How does Nested-Virtualization works?

What is Nested virtualization?
-----------------------------------------------------------------------------------------------------
Nowadays, Software Security is becoming more important criteria in the industry, and in recent years, virtualization as a popular topic for protecting / attacking a software, however, most of the virtualization technology framework (bluepill-liked) is not provide an ability that let a guest virtualize one more layer, we called it "Nested Virtualization", level 2.


Basic Virtual Machine Monitor Architecture
------------------------------------------------------------------------------------------------------
Figure[1]

Host VMM trap any type of event which wants to monitor, such as, Interrupt, exception, privileged register access, one of this event is VMX instruction, after VMM loaded, VMM can always monitor a any one of the  VMX instructions, which provide a good chance for us.

As following chart:
Figure[2]

VMM Life Cycle
-----------------------------------------------------------------------------------------------------

VMExit, everytimes a Guest OS get trapped by, and transfer the control flow to VMM.
VMEntry, everytimes VMM transfers a control flow to Guest OS after each VMExit.


Figure[3]

So, How to support Nested Virtualization ?
------------------------------------------------------------------------------------------------------
- Handle a VMX Instruction events and emulate it correctly, such as, VMXON, VMCLEAR, VMREAD, VMWRITE, etc.

- Normally, any VMM loader wants to load it into the OS, and start trap an OS itself, it has a regular and flow, such as, below:

Figure[4]

as we can see, this is a regular initialization process of a VMM, we are able to catch up each of them by physical processor provided ability (VT-x).


First, We could emulated a VMXON instruction, normally, we should do a basic check for the guest environment, but we actually can be emulated the result directly by executing a real VMXON, instead of emulating all of these event. Moreover, we could initialize a structure that is used during nested-virtualization process, such as, bookkeeping for current VCPU status(VMX root Mode).
Figure[5]

Second, when a VMM receive a VMCLEAR/ VMPTRLD event from the guest, it means a VMCS is prepared, we called it is VMCS1-2 , because , it is level 1 (Nested VMM) prepared for level 2 (Nested Guest OS), and we need to store the address, and never use it in physical processor.

VMCS1-2 can be redefined, because how processor works, it should not be transparent to the VMM, so that we can use the address freely.

After VMPTRLD, a guest OS starts fill the VMCS in, we are able to capture all of this VMCS initialization process by trapping VMREAD / VMWRITE. We can know about the current VMCS field, its value, After that,  Level 0 receives these type of event, and decode the parameters, so that can be synchronized with custom structure, and fill in VMCS1-2.

Third, after the VMREAD/ VMWRITE is done, VMM should execute VMLAUNCH instruction, and it is most interesting part, we could currently create a new VMCS, we called it VMCS0-2, since it is level 0 VMM created for level 2 Nested Guest OS.

VMCS0-2 is put a VMM running is guest OS :
It is interesting that , VMCS0-2's guest field is VMCS1-2' host field, what does it implies??
This time we used a VMCS0-2 to execute a physical VMLAUNCH, The entry point of the next VMEntry may not be address of  nested VMM VMLAUNCH Instruction pointer , but the new guest entry point, guest RIP in VMCS1-2.


Figure[6]


After VMLAUNCH, it should be executing the guest OS code again, and the physical processor mode is transferred to non-root mode again, which means any type of event which is Level 0 / Level 1 VMM wants, it would be trapped. As a following architecture, it is funny that, after VMLAUNCH, Nested VMM is not a first priority for overtaking the trapped event, due to the physical processor is only permit one layer VMExit for an any given time.

Figure[7]


Since the Guest OS is running with VMCS0-2 now, so that Host VMM is able to identify that event is caused by which layer (nested vmm or guest os) and Host VMM should be able to bookkeeping the VCPU status.

Then,  level 2 VMExit to level 0, and level 0 knows it is trapped from level 2, therefore,  it try to read and decode the VMCS1-2 and so that it can identify that "Does level 1 expect to handle this event?" If so, VMCS0-2 will be magically changed to below status:


Figure[8]

The magic part is that, the Guest State Field (Guest Rip , Rsp , Rflags (clear interrupt) ) will be filled with VMCS1-2 Host state field, what is the Host State RIP? It is a VMExit entry point of L1, nested-VMM. And L0 try to construct a ideal VMCS0-1, saving a guest context, vmexit information into VMCS1-2 and transfer the control to guest rip, nested vmm VMExit handler.

As a following situation:

Figure[9]

After a VMExit emulation, Nested VMM should be execute vmresume to get back to the guest, so that Host VMM should be prepare a VMCS0-2 again (like vmlaunch emulation) and  VMEntry to the Guest OS, which it is controlled by VMCS1-2. Usually, it will be the next instruction of the VMExit instruction.

Conclusion
-----------------------------------------------------------------------------------------------------
As a result, we a able to control all of the event control flow of a VMM.
Next topic should be Extended Page Table virtualization.





Source code Reference
-----------------------------------------------------------------------------------------------------
https://github.com/Kelvinhack/kHypervisor
https://github.com/tandasat/HyperPlatform


Comments

Popular posts from this blog

Understanding ACPI and Device Tree

Windows Mini Class and Class Driver internal research notes