Guest lecture for Compiler Construction, Spring 2015
Verified compilers Magnus Myréen Chalmers University of Technology Mentions joint work with Ramana Kumar, Michael Norrish, Scott Owens and many more
Course info to compiling Examples Guest lecture forIntroduction Compiler Construction, Spring Javalette 2015
LLVM
CompCert 2005 – Program verification For safety-critical software, formal verification of program correctness may be worth the cost.
Verified compilers
Such verification is typically done of the source program. So what if the compiler is buggy?
Use a certified compiler!
CompCert is a compiler for a large subset of C, with PowerPC assembler as target language.
What? Written in Coq, a proof assistant for formal proofs.
Comes with a machine-checked proof that for any program, which does not generate a compilation error, the source and target programs behave identically. (Precise statement needs more details.)
(Sometimes called certified compilers, but that’s misleading…)
Trusting the compiler
Est
Trusting the compilerTesting compilers Bugs When finding a bug, we go to great lengths to find it in our own code. Bugs Most programmers trust the compiler to generate correct code When finding a bug, we go to great lengths to find it in our own The most important task of the compiler is to generate correct code. Establishing Compiler Correctness code Most programmers trust the compiler Maybe to generate it iscorrect worthcode the
Es
cost?
The most important task of the compiler is to generate correct code Cost reduction?
Establishing Compiler Correctness
e
t
de
Alternatives Proving the correctness of a compiler is prohibitively expensive (however, see the CompCert project) Testing is the only viable option Alternatives Proving the correctness of a compiler is prohibitively expensive … but(however, with testing you never know you caught all bugs! see the CompCert project)
All (unverified) compilers have bugs “ Every compiler we tested was found to crash and also to silently generate wrong code when presented with valid input. ” PLDI’11
ilers p m o C C in s g u B g in nd Finding and Understa Xuejun Yang
Yang Chen
Eric Eide
John Regehr
ol of Computing University of Utah, Scho edu ide, regehr }@cs.utah. { jxyang, chenyang, ee
“ [The verified part of] CompCert is the only compiler we havet tested for which Csmith cannot find wrong-code Abstrac errors. This is not for lack of trying: we have devoted about six CPU-years to the task.” mpilers, prove the quality of C co im To ct. rre co be ld ou sh and Compilers t-case generation tool, tes d ize om nd ra a , ith this period we created Csm d compiler bugs. During fin to it ing us ars ye piler ee thr spent ly unknown bugs to com us io ev pr 5 32 an th e or m we reported s found to crash and also wa ted tes we ler pi m co valid input. developers. Every de when presented with co g on wr te ra ne ge tly results to silen ler-testing tool and the pi m co r ou t en es pr we In this paper is to advance the . Our first contribution dy stu g in nt hu gbu Csmith r ou of g. Unlike previous tools, tin tes ler pi m co in t ar the e state of th subset of C while avoiding ge lar a r ve co t tha s ram ability generates prog s that would destroy its
1 2 3 4 5
int foo (void) { signed char x = 1; 5; unsigned char y = 25 return x > y; }
of GCC that shipped with on rsi ve the in g bu a d un Figure 1. We fo els it compiles 6. At all optimization lev x8 r fo 4.1 8.0 x nu Li mpiler tu Ubun result is 0. The Ubuntu co ct rre co the 1; n ur ret to g. this function GCC did not have this bu of on rsi ve se ba the ; ed was heavily patch d Csmith, a randomized
test-case generator that
sup-
This lecture: Verified compilers What? Proof that compiler produces good code.
rest of this lecture
Why?
To avoid bugs, to avoid testing.
How?
By mathematical proof…
Proving a compiler correct like first-order logic, or higher-order logic
Ingredients: • a formal logic for the proofs • accurate models of • the source language • the target language • the compiler algorithm Tools: • a proof assistant (software)
proofs are only about things that live within the logic, i.e. we need to represent the relevant artefacts in the logic a lot of details… (to get wrong)
… necessary to use mechanised proof assistant (think, ‘Eclipse for logic’) to avoid mistakes, missing details
Accurate model of prog. language Model of programs: • syntax — what it looks like • semantics — how it behaves e.g. an interpreter for the syntax
Major styles of (operational, relational) semantics: this style for structured source semantics • big-step this style for unstructured target semantics • small-step … next slides provide examples.
Syntax Source: exp = Num num | Var name | Plus exp exp
Target ‘machine code’: inst = Const name num | Move name name | Add name name name
Target program consists of list of inst
Source semantics (big-step) Big-step semantics as relation ↓ defined by rules, e.g. lookup s in env finds v (Num n, env) ↓ n
(x1, env) ↓ v1
(Var s, env) ↓ v
(x2, env) ↓ v2
(Add x1 x2, env) ↓ v1 + v2
called “big-step”: each step ↓ describes complete evaluation
Target semantics (small-step) “small-step”: transitions describe parts of executions We model the state as a mapping from names to values here. step (Const s n) state = state[s ↦ n] step (Move s1 s2) state = state[s1 ↦ state s2] step (Add s1 s2 s3) state = state[s1 ↦ state s2 + state s3] steps [] state = state steps (x::xs) state = steps xs (step x state)
Compiler function generated code stores result in register name (n) given to compiler compile (Num k) n = [Const n k] compile (Var v) n = [Move n v]
Relies on variable names in source to match variables names in target.
compile (Plus x1 x2) n = compile x1 n ++ compile x2 (n+1) ++ [Add n n (n+1)]
Uses names above n as temporaries.
Correctness statement Proved using proof assistant — demo! For every evaluation in the source … ∀x env res. (x, env) ↓ res
for target state and k, such that …
∀state k. (∀i env v. (lookup env i = SOME v)
(state i = v) ∧ i < k)
(let state' = steps (compile x k) state in (state' k = res) ∧ ∀i. i < k (state' i = state i))
k greater than all var names and state in sync with source env …
… in that case, the result res will be stored at location k in the target state after execution … and lower part of state left untouched.
A real language
Well, that example was simple enough…
But: Some people say: A programming language isn’t real until it has a self-hosting compiler
Bootstrapping for verified compilers? Yes!
Scaling up… POPL 2014 L M f o n o i lementat
p m I d e fi i r e V A : L CakeM umar K a n a m a R
reen nus O. My
1
2
3
ridge, U‡ K b m a C f o y , Universit y r to ustralia a r A o , b a A T L r C I te N u , b Comp esearch La iversity of Kent, UK R a r 2 r e b n a C puting, Un 3 chool of Com S
Mag
⇤ 1
† 1
orrish Michael N
ns Scott Owe
ation;
n ed compil sed o i ifi t r c e v u d in o t r a es 1. Int trong inter file results, many b is s a n e e s s o interest high-pr ecade ha
The last d ve been significant, 1, 14, 16, 29]. This nverified d e [ a u system call . nd there h ert compiler for C ram verification, an L a M n a d e t L C g p rifi omputing M o e c r m v p d o d r f y C a te o ll d s t a e n u x ic Abstrac th tr ta te n S e a n on y: in the co d complex part of th existing work on d and mech ubstantial subset of al-print loop f e ti p s lo ju e v to e d y s n e s a ev ea We have ms a large nowledge, none of th s has addressed all supports a an interactive read- orem ensures r o h f ic r h e il w p , m e e k co CakeML mented as e. Our correctness th lts permitted er, to our eral-purpose languag e, the compilation v le e p w o im H is . e s d n n su ba CakeML ilers for ge g two dimensions: o string to a list of machine co prints only those re t touches on p 4 m 6 o c 6 8 d x e ifi in n ver er alo ntation a source n effor il e o (REPL) p m ti m m a o le r c o f in n of that p c o ifi , r a m ti g im e a u f r v in c L o g k e r P o c x u ts r E e e c p O h R e e c th L. ga asp that this of CakeM lexing, parsing, type on, arbitraryr convertin chine code, and two, o s f c ti m n h a it r m o e a ti alg by the s verified esenting m in machine code. s including ation, garbage collec r e ic p v p e a r h to s r f e e o b w num a breadth d dynamic compil plemented r is to explain how imensions for a dapping. im il tr s u ts a b o o m in b h n r it a r e ly l o alg simp this pape is f these d is o d compil crementa in n t s a th r e h , o s c fi c b o a ti p e f e e r o h t u m r language a T p e h u p th . r it O o r u ld g c . a s o e O n f n g ti o ll a a io u u tr f tw is g s c pre ng the ctional ing lan demon ns are ed n lo s o , m u a o d f ti m e p u r t a e r m ic ifi ib il r g o tr tr p e o c s n r v , m o p e re b e nd a co Our c ral-purpos strongly typed, impu erified, we mean e is end-to-e fort can in practice rely on any n t e a g th l, a c m ti te prac ieces it is a l. By v on ef ing a sys el apand OCam a verificati ne of the p code along keML, and
First bootstrapping of a formally verified compiler.
Dimensions of Compiler Verification source code
how far compiler goes
abstract syntax intermediate language bytecode
Our verification covers the full spectrum of both dimensions.
machine code
compiler algorithm
implementation in ML
implementation in machine code
the thing that is verified
machine code as part of a larger system
Idea behind in-logic bootstrapping input: verified compiler function
Trustworthy code generation: functions in HOL (shallow embedding) proof-producing translation [ICFP’12, JFP’14] CakeML program (deep embedding) verified compilation of CakeML [POPL’14] x86-64 machine code (deep embedding) output: verified implementation of compiler function
The CakeML at a glance strict impure functional language
The CakeML language = Standard ML without I/O or functors
i.e. with almost everything else: ✓ higher-order functions ✓ mutual recursion and polymorphism ✓ datatypes and (nested) pattern matching ✓ references and (user-defined) exceptions ✓ modules, signatures, abstract types The verified machine-code implementation: parsing, type inference, compilation, garbage collection, bignums etc. implements a read-eval-print loop (see demo).
The CakeML compiler verification How? Mostly standard verification techniques as presented in this lecture, but scaled up to large examples. (Four people, two years.) Compiler: string
tokens
AST
IL
bytecode
New optimising compiler:
x86
ARM x86-64
IL-1
IL-2
…
IL-N
ASM
… work in progress (want to join?
[email protected])
MIPS-64
Compiler verification summary Ingredients: • a formal logic for the proofs • accurate models of • the source language • the target language • the compiler algorithm Tools: • a proof assistant (software) Method: • (interactively) prove a simulation relation Questions? Interested?