## 

Inter-VM data exfiltration

The art of cache timing covert channel on x86 multi-core

Etienne Martineau Kernel Developer

August 2015



NG UPpatraitrei



## **Disclaimer**

- Research... own time... my opinions... not my employers...
- The information and the code provided in this presentation is to be used for educational purposes only.
- I am in no way responsible for any misuse of the information provided.
- In no way should you use the information to cause any kind of damage directly or indirectly.

# About me IIIIII CISCO

VM

$$\nabla \cdot \vec{E} = \frac{\rho}{\varepsilon_0} = 4\pi k\rho \qquad \oint \vec{E} \cdot d\vec{A} = \frac{q}{\varepsilon_0}$$
$$\nabla \cdot \vec{B} = 0 \qquad \oint \vec{B} \cdot d\vec{A} = 0$$
$$\nabla \mathbf{x} \vec{E} = -\frac{\partial \vec{B}}{\partial t} \qquad \oint \vec{E} \cdot d\vec{s} = -\frac{d\Phi_B}{dt}$$
$$\nabla \mathbf{x} \vec{B} = \frac{\vec{J}}{\varepsilon_0 c^2} + \frac{1}{c^2} \frac{\partial \vec{E}}{\partial t} \qquad \oint \vec{B} \cdot d\vec{s} = \mu_0 i + \frac{1}{c^2} \frac{\partial}{\partial t} \int \vec{E} \cdot d\vec{A}$$



| Hyper-threade | d      |        | Non Hyper-thr | eaded  |        |
|---------------|--------|--------|---------------|--------|--------|
| vCPU0         | vCPU1  | SUM    | vCPU0         | vCPU1  | SUM    |
| 200000        | 200228 | 400228 | 100184        | 100184 | 200368 |
| 200088        | 200084 | 400172 | 100184        | 100184 | 200368 |
| 210768        | 193512 | 404280 | 100184        | 100184 | 200368 |
| 200096        | 200084 | 400180 | 100184        | 100184 | 200368 |
| 200072        | 200100 | 400172 | 100184        | 100188 | 200372 |
| 187312        | 226556 | 413868 | 100184        | 100184 | 200368 |
| 204776        | 205364 | 410140 | 100184        | 100184 | 200368 |
| 186996        | 231952 | 418948 | 100180        | 100188 | 200368 |
| 200016        | 200176 | 400192 | 100180        | 100188 | 200368 |
| 200088        | 200084 | 400172 | 100188        | 100184 | 200372 |
| 200084        | 200088 | 400172 | 100184        | 100184 | 200368 |
| 200076        | 200096 | 400172 | 100184        | 100184 | 200368 |
| 200084        | 200088 | 400172 | 100184        | 100184 | 200368 |
| 200240        | 191980 | 392220 | 100184        | 100184 | 200368 |
| 204588        | 205536 | 410124 | 100184        | 100188 | 200372 |
| 200000        | 200204 | 400204 | 100184        | 100188 | 200372 |

| Hyper-threade | d      |        | Non Hyper-thr | eaded  |        |
|---------------|--------|--------|---------------|--------|--------|
| VCPU0         | vCPU1  | SUM    | vCPU0         | vCPU1  | SUM    |
| 200000        | 200228 | 400228 | 100184        | 100184 | 200368 |
| 200088        | 200084 | 400172 | 100184        | 100184 | 200368 |
| 210768        | 193512 | 404280 | 100184        | 100184 | 200368 |
| 200096        | 200084 | 400180 | 100184        | 100184 | 200368 |
| 200072        | 200100 | 400172 | 100184        | 100188 | 200372 |
| 187312        | 226556 | 413868 | 100184        | 100184 | 200368 |
| 204776        | 205364 | 410140 | 100184        | 100184 | 200368 |
| 186996        | 231952 | 418948 | 100180        | 100188 | 200368 |
| 200016        | 200176 | 400192 | 100180        | 100188 | 200368 |
| 200088        | 200084 | 400172 | 100188        | 100184 | 200372 |
| 200084        | 200088 | 400172 | 100184        | 100184 | 200368 |
| 200076        | 200096 | 400172 | 100184        | 100184 | 200368 |
| 200084        | 200088 | 400172 | 100184        | 100184 | 200368 |
| 200240        | 191980 | 392220 | 100184        | 100184 | 200368 |
| 204588        | 205536 | 410124 | 100184        | 100188 | 200372 |
| 200000        | 200204 | 400204 | 100184        | 100188 | 200372 |



Cisco Confidential

| int <sub>e</sub> l.                    | Restart R. Pauser II. Glose X |
|----------------------------------------|-------------------------------|
| Processor without H                    | oper-Threading Technology     |
| Thread 1                               | and Bacces                    |
| COCCOCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC |                               |

An Intel processor with HT Technology can execute two software threads in an increasingly parallel manner, utilizing previously unused resources.

| Frond 1  |                                         |  |
|----------|-----------------------------------------|--|
| Thread 2 | 000000000000000000000000000000000000000 |  |
|          | Intel® Processor with HT Technology     |  |

the second se





Intel® Processor with HT Technology

ing products on Research to come a set of a second processing that free laws



NG Update





ing processing on the second structures, in such as a processing that the lower of









## **Overview**

Goal

- Practical implementation (not just some research stuff)

#### How

- Abusing X86 shared resources
- Cache line encoding / decoding
- Getting around the HW pre-fetcher
- Data persistency and noise. What can be done?
- Guest to host page table de-obfuscation. The easy way
- High precision inter-VM synchronization: →All about timers
- Detection / Mitigation

## **Shared resource: HT enabled**



### **Shared resource: HT disabled**



http://it.slashdot.org/story/05/05/17/201253/hyper-threading-linus-torvalds-vs-colin-perciv

NG Update

Cisco Confidentia

#### **Shared resource: Multi socket**



NG Update

Cisco Confidentia

© 2009 Cisco Systems, Inc. All rights reserved. 16



#### /byte Cache line (64 bytes)

#### lat\_mem\_rd: "out of the box"



2GB array, stride 128, single thread

#### /byte Cache line (64 bytes)

lat\_mem\_rd: "out of the box"



2GB array, stride 128, single thread

VM#1 encode a pattern in cache line CL0 | CL1 | CL2 | CL3 | CL4 1 | 0 | 0 | 0 | 1 Load | Flush| Flush| Flush| Load

# byte Cache line (64 bytes)

lat\_mem\_rd: "out of the box"



2GB array, stride 128, single thread

VM#1 encode a pattern in cache line CL0 | CL1 | CL2 | CL3 | CL4 1 | 0 | 0 | 0 | 1 Load | Flush| Flush| Flush| Load

VM#2 decode the cache line access time CL0 CL1 CL2 CL3 CL4 Fast Slow Slow Slow Fast 1 0 0 0 1

| <mark>/</mark> *<br>* |              |                 |          |
|-----------------------|--------------|-----------------|----------|
| *                     | [Client]     | >               | [Server] |
| *                     | [celene]     | Signal          | [Jerver] |
| *                     |              | Jighac          |          |
| *                     |              |                 |          |
| *                     |              |                 |          |
| *                     |              | [CL0]           |          |
| *                     | \-Encode-:   | > [CL1] ->      | Decode-/ |
| *                     | ( Encode )   | [CL2]           | becode / |
| *                     |              | [CLn]           |          |
| *                     |              | Host            |          |
| *                     |              | 1100 0          |          |
| *                     | [Client Enco | de l            |          |
| *                     |              |                 |          |
| *                     |              | +               |          |
| *                     | ( Sign       | al Server )     |          |
| *                     | ( orgin      | +               |          |
| *                     |              |                 | >        |
| *                     |              | [Server Decode  |          |
| *                     |              | Loci ici becouc | > TIME   |
| */                    |              |                 |          |
| /                     |              |                 |          |



- Simple Client / Server test program
- Cache Line from shared memory directly
- Mutex for inter-process signaling
- Client encode a pattern



| /* |             |                |                                  |
|----|-------------|----------------|----------------------------------|
| *  |             |                |                                  |
| *  | [Client]    | >              | [Server]                         |
| *  |             | Signal         |                                  |
| *  | i i         |                | i i                              |
| *  | 1           |                | 1                                |
| *  | 1           |                |                                  |
| *  |             | [CL0]          |                                  |
| *  | \-Encode    | -> [CL1] ->    | >Decode-/                        |
| *  |             | [CL2]          |                                  |
| *  |             | [CLn]          |                                  |
| *  |             | Host           |                                  |
| *  |             |                |                                  |
| *  | [Client Enc | ode]           |                                  |
| *  |             |                |                                  |
| *  |             | +              |                                  |
| *  | ( Sig       | nal Server )   |                                  |
| *  | 1 9         | +              |                                  |
| *  |             |                | ->                               |
| *  |             | [Server Decode |                                  |
| *  |             |                | -,<br>> T                        |
| */ |             |                | cessore enders and a loss of the |
|    |             |                |                                  |

#### NO VM

- Simple Client / Server test program
- Cache Line from shared memory directly
- Mutex for inter-process signaling
- Client encode a pattern
- Server decode
- <sup>™</sup> →Something weird?



- Simple test:
- Flush CL0 -> CL100
- Measure CL access time for CL0 -> CL100
- → Long latency for all CL

Zap Cache Line 0->100: DONE

```
Load Cache Line 0->100 ( TSC cycle ):
240 264 232 232 236 232 232 232 232 232
236 228
         68
              68
                  64
                       68 232 260 232 232
                     232 236 232
232 232 232 232 232
                                         64
                                    64
 64
     64
              64
                  64
                       68
                           68
                                         68
         64
                                64
                                    64
 64
     68
         64
              64
                  68
                           64
                                    68
                                         64
                       64
                                68
 64
     68
         64
              64
                  68
                       64
                           68
                                68
                                    64
                                         64
              64 232 236 228
                               232
 64
     64
          68
                                   228
                                       236
232 236
        232 232 228 236
                           68
                                64
                                    64
                                         64
 64
     64
          68
              64
                  64
                       68
                           68
                                64
                                    64
                                         68
 64
     68
          64
              64
                  64
                       64
                           68
                                68
                                         64
                                    64
```

- Simple test:
- Flush CL0 -> CL100
- Measure CL access time for CL0 -> CL100
- → Long latency for all CL

• ???

Zap Cache Line 0->100: DONE

```
Load Cache Line 0->100 ( TSC cycle ):
240 264 232 232 236 232 232 232 232 232 232
236 228
              68
                   64
                       68 232 260 232 232
          68
                      232 236 232
                                          64
232 232
        232 232 232
                                     64
     64
              64
                   64
                       68
                            68
64
          64
                                 64
                                     64
                                          68
     68
                   68
                                          64
64
          64
              64
                       64
                            64
                                 68
                                     68
64
     68
          64
              64
                   68
                       64
                            68
                                 68
                                     64
                                          64
              64 232 236 228
64
     64
          68
                               232
                                    228
                                         236
232 236
        232 232 228 236
                            68
                                 64
                                     64
                                          64
64
     64
          68
              64
                   64
                       68
                            68
                                 64
                                          68
                                     64
 64
     68
          64
              64
                   64
                       64
                            68
                                 68
                                          64
                                     64
```

- Simple test:
- Flush CL0 -> CL100
- Measure CL access time for CL0 -> CL100
- → Long latency for all CL

• ???

Prefetching in general means bringing data or instructions from memory into the cache **before they are needed** 

Zap Cache Line 0->100: DONE

```
Load Cache Line 0->100 ( TSC cycle ):
240 264 232 232 236 232 232 232 232 232 232
              68
236 228
          68
                   64
                       68 232 260 232 232
232 232
                      232 236
                               232
                                          64
        232 232 232
                                     64
     64
              64
                   64
                       68
                            68
                                          68
64
          64
                                 64
                                     64
     68
          64
              64
                   68
                       64
                            64
                                          64
64
                                 68
                                     68
64
     68
          64
              64
                   68
                       64
                            68
                                 68
                                     64
                                          64
              64 232 236 228
64
     64
          68
                               232
                                    228
                                        236
232 236
        232 232 228 236
                            68
                                 64
                                     64
                                          64
64
     64
          68
              64
                   64
                       68
                            68
                                 64
                                          68
                                     64
     68
          64
              64
                   64
                       64
                            68
                                 68
                                          64
64
                                     64
```

- Simple test:
- Flush CL0 -> CL100
- Measure CL access time for CL0 -> CL100
- → Long latency for all CL

• ???

Prefetching in general means bringing data or instructions from memory into the cache **before they are needed** 

The Core<sup>m</sup> i7 processor and Xeon<sup>®</sup> 5500 series processors, for example, have some prefetchers that bring data into the L1 cache and some that bring data into the L2.

There are also different algorithms – some monitor data access patterns for a particular cache and then **try to predict what addresses will be needed in the future.** 

| Zap Cache<br>Load Cache                                                                                                                  |                                                                                                                                      |                                           |                                                                                                                                                                                                                   | <ul> <li>Simple test:</li> <li>Flush CL0 -&gt; CL100</li> <li>Measure CL access time</li> </ul> |                                                                                                                             |      |
|------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|------|
| 240 264 2                                                                                                                                |                                                                                                                                      | 10000                                     | Main Devices Startup                                                                                                                                                                                              |                                                                                                 |                                                                                                                             | ower |
| 236 228                                                                                                                                  |                                                                                                                                      | 127 27 28                                 |                                                                                                                                                                                                                   |                                                                                                 |                                                                                                                             |      |
| 232 232 23<br>64 64 64<br>64 68 64<br>64 68 64<br>64 64 64<br>232 236 23<br>64 64 64                                                     | 32       232         64       64         64       64         64       64         64       64         64       64         64       64 | 232<br>64<br>68<br>68<br>232<br>228<br>64 | PCI Parity:<br>Plug and Play Operating<br>Legacy Free<br>Default Primary Video<br>Turbo Memory<br>4GB PCI Hole (<1GB)<br>4GB PCI Hole Granularic<br>Active Processors<br>Hyperthreading:<br>Set Max Ext CPUID = 3 | Adapter:                                                                                        | [Enabled]<br>[No]<br>[Disabled]<br>[Auto]<br>[Enabled]<br>[Enabled]<br>[1.0 GB]<br>[Max. Cores]<br>[Disabled]<br>[Disabled] |      |
| Prefetching in general<br>the cache <b>before they</b><br>The Core <sup>™</sup> i7 process<br>have some prefetchers<br>data into the L2. |                                                                                                                                      |                                           | Intel(R) Uirtualizatio                                                                                                                                                                                            |                                                                                                 | [Enabled]<br>[Enabled]<br>[Enabled]<br>[Enabled]                                                                            |      |

There are also different algorithms – some monitor data access patterns for a particular cache and then try to predict what addresses will be needed in the future.

 Simple trick that randomized CL access







- Simple trick that randomized CL access
- CL access random within a page





- Simple trick that randomized CL access
- CL access random within a page
- CL access random across pages





- Simple trick that randomized CL access
- CL access random within a page
- CL access random across pages
- This apparently manage to confuse the HW prefetcher!



What happen if we waitlonger before decoding?





What happen if we wait

longer before decoding?

Wait





What happen if we wait

longer before decoding?

Wait

Wait







/\*



What happen if we wait longer before decoding?

Wait

Wait

Wait









What happen if we wait longer before decoding?

Wait

Wait

Wait

 Encoded data in the cache evaporates pretty quickly.



















Client in VM#1, Server in VM#2



- Client in VM#1, Server in VM#2
- L2 OR L3 cache are tagged by the physical address but in a VM the physical address that you see has nothing to do with the real physical address on bare metal that the cache is using.





## **Page de-duplication**

KSM enables the kernel to examine two or more already running programs and compare their memory. If any memory regions or pages are identical, **KSM merge them into a single page physical page on bare-metal host kernel.** 



## **Page de-duplication**

KSM enables the kernel to examine two or more already running programs and compare their memory. If any memory regions or pages are identical, **KSM merge them into a single page physical page on bare-metal host kernel.** 

If one of the programs wants to modify a shared page KSM kicks in and un-merge it.



## **Page de-duplication**

KSM enables the kernel to examine two or more already running programs and compare their memory. If any memory regions or pages are identical, **KSM merge them into a single page physical page on bare-metal host kernel.** 

If one of the programs wants to modify a shared page KSM kicks in and un-merge it.

This is useful for virtualization with KVM. Once the guest is running the contents of the guest operating system image can be shared when guests are running the same operating system or applications.





#### Page table deobfuscation









There is no synchronization primitive across processes running in different VM ???



- There is no synchronization primitive across processes running in different VM ???
- In reality there is mechanism to do that (EX ivshmem) but this is not enabled in production env



There is no synchronization primitive across processes running in different VM ???

- In reality there is mechanism to do that (EX ivshmem) but this is not enabled in production env
- We need something to replace the mutex



- Forget about the synchronization aspect and hope for the best
- With error correction we can achieve some data transmission.
- Very low bit rates
- CPU consumption is low



- Busy loop on each side
- Client faster than Server
- At some point there will be an overlap and the server will pickup the signal
- CPU consumption is **High**
- OK bit rates
- We want <1% CPU usage to remain undetected.



- Define a common period 'T'
- Client-Server lock into phase



- Define a common period 'T'
- Client-Server lock into phase
  - Server sends a sync pattern



- Define a common period 'T'
- Client-Server lock into phase
- Server sends a sync pattern
- Client sweep over the period in search for the sync



 Once the sync is found the phase is adjusted are we are ready for transmission



- Once the sync is found the phase is adjusted are we are ready for transmission
- For that to work we need a monotonic pulse



- Once the sync is found the phase is adjusted are we are ready for transmission.
- For that to work we need a monotonic pulse
- Some jitter but not too much ( Lots of noise in VMs → data evaporates out of the cache very quickly )

How to achieve a monotonic pulse?

How to achieve a monotonic pulse?

Timers

- How to achieve a monotonic pulse?
- Timers
- Why timers?
- We need to sleep → Avoid detection ( < 1% CPU usage )</li>



- How to achieve a monotonic pulse?
- Timers
- Why timers?
- We need to sleep → Avoid detection ( < 1% CPU usage )



Jitter comes from both VM



- Jitter comes from both VM
- Too much jitter



 The idea here is to do padding up to some value above the maximum jitter



 The idea here is to do padding up to some value above the maximum jitter



- The idea here is to do padding up to some value above the maximum jitter
- The problem here is that the padding is subject to noise
- In other word more time you spend trying to immunize yourself to noise more noise you end up accumulating



- The idea here is to do padding up to some value above the maximum jitter
- The problem here is that the padding is subject to noise
- In other word more time you spend trying to immunize yourself to noise more noise you end up accumulating
- Padding consume CPU
- By stretching the timer period it's easy to stay under 1% of CPU usage

It's a tricky problem but at the end I got it right!

- It's a tricky problem but at the end I got it right!
- In short the padding is using a calibrated software loop that is kept in check with the TSC



- It's a tricky problem but at the end I got it right!
- In short the padding is using a calibrated software loop that is kept in check with the TSC
- Assume 2.4Ghz machine;
- On a idle system:
   ~50 cycle → 20 nSec





- It's a tricky problem but at the end I got it right!
- In short the padding is using a calibrated software loop that is kept in check with the TSC
- Assume 2.4Ghz machine;
- On a idle system:
   ~50 cycle → 20 nSec
- On a loaded system
   ~300 cycle → 120 nSec





- It's a tricky problem but at the end I got it right!
- In short the padding is using a calibrated software loop that is kept in check with the TSC
- Assume 2.4Ghz machine;
- On a idle system:
   ~50 cycle → 20 nSec
- On a loaded system
   ~300 cycle → 120 nSec

#### Timers:

- 100uSec = 240 000 cycle
- 10uSec = 24 000 cycle ( best case )

## Recap

Encoding / decoding based on memory access time

-(1 = slow, 0 = fast)

- Got rid of the HW prefetching (without disabling it from BIOS!)
  - (randomized the access to cache lines / pages )
- Physical memory pages that are shared across VM
  - Thanks to KSM ☺
- PLL and high precision inter-VM synchronization
  - (Compensated timer <120 nSec jitter)</li>
- Time for a demo!









## **Mitigation**

- Disable page-deduplication (KSM) / Per-VM policy
  - No inter-VM shared read-only pages
  - Flush 'clflush' and reload won't work
  - No OS / Application fingerprinting (de-duplication page-fault)
  - Higher memory cost
- X86 'clflush' instruction: Privilege?
  - Microcode?
- Co-location policy (per-core / per-socket / per-box)

## **Detection**

- Hardware counter
- Inter-VM scheduling "abnormality"
- TSC related "abnormality"

# Thank you!