Speculative Execution Bugs
Remember the pipeline#
- Fetch
- Decode
- Execute
- Write-back
Serial instructions Not this simple in practice
What about branches?#
- Wait for the branch to execute -> Inefficient
- Loop unrolling -> don't worry about binary size and minimize branches
- Instruction scheduling -> compute the branch early
- Delay slots -> cross thread
- Branch prediction + speculative execution
Branch Prediction#
Need to guess two things
- Which branch will be taken
- Where will the jump occur
Answer for the following:
J (Jump)
JR (Jump Register)
BEQZ (Branch if equal zero)
Guessing wrong has a penalty, so guess well!
First attempt: Static Branch Prediction#
- Build rules based on common behavior
- Example
- Never assume jumps (Why?)
- Never assume jumps, execute everything sequentially
- Always assume jumps (Why?, 76.88% accurate)
- Code is a lot of
forloops - This ignores
ifstatements though
- Code is a lot of
- Always take backwards branches (Why?, 80.28% accurate)
- Loops are typically backwards branch
- "Hints" -> Let the coder suggest (Why not?)
- You don't know how someone is going to use the program
- Never assume jumps (Why?)
- Results can be better than random
Dynamic Branch Prediction#
Look for patterns:
- Temporal: If you take a branch once, maybe you'll take it again?
- Spatial: Branches may have correlated behaviors.
Simple Branch History Table#
- Consider a four state diagram
- Strongly taken
- Weakly taken
- Weakly not taken
- Strongly not taken
- Not great, but better (80-90% accuracy)
- How much memory does this take?
- What is the complexity cost?
Better Branch Predictor#
Consider two levels:
- First level is the branch history (what has happened recently)
- Second level is a pattern history table
- Given branch history, what is the most likely outcome
- Benefits, can allow for simple patterns to emerge
- Gets better accuracy
Better Better Branch Predictor#
- Local -> keep a pattern for each conditional
- Global -> keep one pattern for everything
- Takes advantage of global patterns and behaviors
- Known as GShare
Branch Target Predictor#
- We can predict with 95% accuracy whether to jump
- But where to?
Goal: design predictors for where to jump Update PC in pipeline accordingly
Return Addresses#
- Common side cases
- MUX together with BTB for returns
- Brings misprediction way down
Other improvements#
- Use multiple predictors
- Neural Networks
- Etc...
Multi-Threading (Thread-Level Parallelism)#
- Eliminate guessing if you multithread programs/processes
- Fine-grain: Interleave every instruction (not most efficient)
- Coarse-grain: Interleave for slow downs (L2 misses)
Multi-processor#
- Options:
- 1 ILP program per core
- Multi-thread TLP per core
- Separate cache for L1/L2
- Partially shared cache for L3
- Shared cache for L4 (and above)
Important Notes#
- Data is loaded in cache before a decision is made
- Based on most likely state
- Load first, ask permission later
- Data is not deleted from cache, it will be replaced soon, right?
Interacting with the Cache#
- Determine whether something is there
- Try to load that item from memory, if the result is loaded quickly, it was probably in the cache. If it takes a while to load, it was probably in main memory and had to be fetched
- Other Interactions
- Evict -> Call memory until a target is gone
- Flush -> Remove something using the CLFLUSH command
- Prime -> Place known address in the cache
Cache Attacks#
- Basic idea
- Set the cache in a known state
- Check if it has changed
- For example
- Evict + time
- Prime + probe
- Flush + reload
- Prime + abort
RSA Example#
- Algorithm: $ C=M^e%m $
- GnuPG: Uses square and multiply exponentiation
- If bit is clear square, reduce
- If bit is set square, reduce, multiply, reduce
- How can we apply flush + reload?
Flush + Reload Other Attacks#
The attack is targeted at the discrete Gaussian sampler, an important step in the Bimodal Lattice Signature Schemes (BLISS). After observing only 450 signatures with a perfect side-channel, an attacker is able to extract the secret BLISS-key in less than two minutes, with a success probability of 0.96
As a result, we can successfully circumvent kernel space ASLR on current operating systems.
We have fully working implementation of our attack which is able to recover AES keys after observing as little as 100 encryptions. It works against the OpenSSL 0.9.8n implementation of AES on Linux Systems.
Flush + Reload More Other Attacks#
Last-level cache attack that can be launched over the web using JavaScript
We present attacks to monitor tap and swipe events as well as keystrokes, and even derive the lengths of words entered on the touchscreen
His generic attack technique allows us to profile and exploit cache-based information leakage of any program automatically, without prior knowledge of specific software versions or even specific system information.
Flush + Flush#
- Replace reload with flush
- If you do a
CFLUSHon something that is not in the cache, it takes less time than aCFLUSHon something that is in the cache.
- If you do a
Transactional Memory#
- No need to lock
- Threads try to dynamically execute in parallel
- If a conflict is detected -> abort and fix
- Keep track of read and write set to find conflicts (Hardware solution Intel TSX)
- TSX behavior
- Cacheline must remain in L1
- Cacheline must remain in L3
- Prime + abort (similar to prime and probe)
Return Oriented Programming#
Basic Idea
- Buffer overflows
- But overwrite became hard
- Chain together sets of
libcreturn addresses leveraging calling conventions