Speculative Execution Bugs

Remember the pipeline#

Fetch
Decode
Execute
Write-back

Serial instructions Not this simple in practice

What about branches?#

Wait for the branch to execute -> Inefficient
Loop unrolling -> don't worry about binary size and minimize branches
Instruction scheduling -> compute the branch early
Delay slots -> cross thread
Branch prediction + speculative execution

Branch Prediction#

Need to guess two things

Which branch will be taken
Where will the jump occur

Answer for the following:

J (Jump) JR (Jump Register) BEQZ (Branch if equal zero)

Guessing wrong has a penalty, so guess well!

First attempt: Static Branch Prediction#

Build rules based on common behavior
Example
1. Never assume jumps (Why?)
  1. Never assume jumps, execute everything sequentially
2. Always assume jumps (Why?, 76.88% accurate)
  1. Code is a lot of for loops
  2. This ignores if statements though
3. Always take backwards branches (Why?, 80.28% accurate)
  1. Loops are typically backwards branch
4. "Hints" -> Let the coder suggest (Why not?)
  1. You don't know how someone is going to use the program
Results can be better than random

Dynamic Branch Prediction#

Look for patterns:

Temporal: If you take a branch once, maybe you'll take it again?
Spatial: Branches may have correlated behaviors.

Simple Branch History Table#

Consider a four state diagram
- Strongly taken
- Weakly taken
- Weakly not taken
- Strongly not taken
Not great, but better (80-90% accuracy)
How much memory does this take?
What is the complexity cost?

Better Branch Predictor#

Consider two levels:

First level is the branch history (what has happened recently)
Second level is a pattern history table
1. Given branch history, what is the most likely outcome
Benefits, can allow for simple patterns to emerge
1. Gets better accuracy

Better Better Branch Predictor#

Local -> keep a pattern for each conditional
Global -> keep one pattern for everything
1. Takes advantage of global patterns and behaviors
Known as GShare

Branch Target Predictor#

We can predict with 95% accuracy whether to jump
But where to?

Goal: design predictors for where to jump Update PC in pipeline accordingly

Return Addresses#

Common side cases
MUX together with BTB for returns
Brings misprediction way down

Other improvements#

Use multiple predictors
Neural Networks
Etc...

Multi-Threading (Thread-Level Parallelism)#

Eliminate guessing if you multithread programs/processes
Fine-grain: Interleave every instruction (not most efficient)
Coarse-grain: Interleave for slow downs (L2 misses)

Multi-processor#

Options:
- 1 ILP program per core
- Multi-thread TLP per core
- Separate cache for L1/L2
- Partially shared cache for L3
- Shared cache for L4 (and above)

Important Notes#

Data is loaded in cache before a decision is made
1. Based on most likely state
2. Load first, ask permission later
Data is not deleted from cache, it will be replaced soon, right?

Interacting with the Cache#

Determine whether something is there
1. Try to load that item from memory, if the result is loaded quickly, it was probably in the cache. If it takes a while to load, it was probably in main memory and had to be fetched
Other Interactions
1. Evict -> Call memory until a target is gone
2. Flush -> Remove something using the CLFLUSH command
3. Prime -> Place known address in the cache

Cache Attacks#

Basic idea
1. Set the cache in a known state
2. Check if it has changed
For example
1. Evict + time
2. Prime + probe
3. Flush + reload
4. Prime + abort

RSA Example#

Algorithm: $ C=M^e%m $
GnuPG: Uses square and multiply exponentiation
If bit is clear square, reduce
If bit is set square, reduce, multiply, reduce
How can we apply flush + reload?

Flush + Reload Other Attacks#

The attack is targeted at the discrete Gaussian sampler, an important step in the Bimodal Lattice Signature Schemes (BLISS). After observing only 450 signatures with a perfect side-channel, an attacker is able to extract the secret BLISS-key in less than two minutes, with a success probability of 0.96
As a result, we can successfully circumvent kernel space ASLR on current operating systems.
We have fully working implementation of our attack which is able to recover AES keys after observing as little as 100 encryptions. It works against the OpenSSL 0.9.8n implementation of AES on Linux Systems.

Flush + Reload More Other Attacks#

Last-level cache attack that can be launched over the web using JavaScript

We present attacks to monitor tap and swipe events as well as keystrokes, and even derive the lengths of words entered on the touchscreen

His generic attack technique allows us to profile and exploit cache-based information leakage of any program automatically, without prior knowledge of specific software versions or even specific system information.

Flush + Flush#

Replace reload with flush
- If you do a CFLUSH on something that is not in the cache, it takes less time than a CFLUSH on something that is in the cache.

Transactional Memory#

No need to lock
Threads try to dynamically execute in parallel
If a conflict is detected -> abort and fix
Keep track of read and write set to find conflicts (Hardware solution Intel TSX)
TSX behavior
- Cacheline must remain in L1
- Cacheline must remain in L3
- Prime + abort (similar to prime and probe)

Return Oriented Programming#

Basic Idea

Buffer overflows
- But overwrite became hard
Chain together sets of libc return addresses leveraging calling conventions