What should every (Dotty) developer know about hardware
Dmitry Petrashko

Dotty?
- a new compiler for a Scala-like language
- developed at LAMP EPFL
- is currently not ready for production use
- some of the new technologies explored in this project will find their way into future versions of Scala
About me:
- https://github.com/darkdimius/
- doing PhD at EPFL
- previously worked on ScalaBlitz
- was first to join Martin in working on Dotty in 2014
- together with Martin build foundations of Dotty
- started Dotty Linker & optimizer project
Agenda
- Basic tools: sampling and tracing profilers
- How do they work?
- Limitations
- How do CPUs work:
- pipelining
- out-of-order execution
- on-chip-caching
- Practical experience: Dotty Miniphases
Sampling profilers: Idea
Collect stacks from all threads every X milliseconds.
Common values for X are 20 - 100 ms.
Sampling profilers: Pros
- no code modification
- can easily connect to already running VM
- very low overhead
Sampling profilers: Cons
- need to collect stacks, 1 ms on average
- stacks are only collectable at safe points
- introduces safe-point bias
- some methods will never show-up
- precise only for blocking and interpreter
Sampling profilers: Awesome for
- having a look at a running production system
- finding deadlocks
- initial investigation
Tracing profilers: Idea
Idea: modify bytecode, making every method remember when it started and finished execution.
Tracing profilers: Pros
- no code modifications
- can see all the methods
- can also sample allocations
Tracing profilers: Cons
- VERY big overhead. 20x slowdowns are not rare
- you are analizing a different application
- different JIT decisions
- different memory behavior, new allocations
- different CPU behaviour
- precise for this new app, but not clear what it tells about yours
Tracing profilers: Awesome for
- finding parts of code that are wrongly assumed to rarely execute
- tracing memory allocations
What if there are no hotspots?
Unfortunately, the compiler doesn't really have obvious hotspots,
profiles show that no method reports more than 5% of time.
- Jason Zaugg
Cpu basics:
Cpu executes in cycles, you code in instructions.
Cycles != Instructions
CPU: sequential

CPU: pipelined
Pipeline Depth in Haswell: 14 - 19, avg 16.
CPU: superscalar

CPU: superscalar
Issue Width in Haswell: 8.
CPU: What could go wrong?
- branch & jump mispredictions(megamorhic dispatch);
- data dependencies;
- data is not available(cache misses).
CPU: Can compiler help?
- branch & jump mispredictions(megamorhic dispatch);
- data dependencies;
- data is not available(cache misses).
CPU: data access time
Data access |
Size |
Cycles |
L1 cache reference |
32 KB |
4 |
L2 cache reference |
256 KB |
12 |
L3 cache reference |
6 MB |
22 |
Main memory reference |
4+ G |
120 |
Disk seek |
100+ GB |
1,000,000+ |
How many μops have you missed by waiting for a single memory read?
120 x 8 x 16 = 15360
Funny details about caches:
- L1 is spread: 16KB for instructions and 16KB for data;
- L1 & L2 are per-core;
- L3 is shared between all cores;
- L3 is inclusive.
Problem with L3 being inclusive
- Caches use Last-Recently used strategy for eviction;
- Accessing data in L1 does not increase counters in L3;
- Adding entries to L3 may evict actively used entries in L1.
How does knowing this help us?
Macro Phases
Macro Phases
Macro Phases
- Every phase traverses a tree independently
- Reading cold data
- Creating new subtrees
- That live long and are promoted to OldGen of GC
Mini Phases
Mini Phases
Mini Phases
Mini Phases
- Phases share tree traversal
- Trees you are accessing are hot
- Trees die fast and do not get promoted to OldGen of GC
GC promotion to OldGen
Memory accesses(cache misses)
Performance
Other approaches to combine with?
Distribute!
See Iulian's talk!
Tools
- tracing & sampling profilers for initial exploration
- Linux perf - for measuring entire applications
- JMH
-perfasm
- for microbenchmarks
Advise
- getting performance is hard, needs a lot of exploration;
- loosing performance is easy, seemingly small changes can kill it;
- tracking performance is crucial.
What should every (Dotty) developer know about hardware
Dmitry Petrashko