Numerical Weather Predic0on Facing the Future with C++
T. Quintino, F. Rathgeber, W. Deconinck, B. Raoult, M. Fuentes ECMWF C++Now 2015, May 13 2015 13/05/15
1
Overview • • • • •
ECMWF Numerical Weather Predic0on Challenges Ahead Leaning from the Past Preparing the Future
13/05/15
2
ECMWF
13/05/15
3
ECMWF An independent intergovernmental organisa0on established in 1975 with 21 Member States 13 Co-‐opera0ng States
13/05/15
4
Who are we and what do we do? Weather Forecasts Medium-‐Range
We produce global weather forecasts Up to 15 days ahead. Our products also include monthly and seasonal forecasts and we collect and store meteorological data.
What do we have to achieve this? People About 260 staff, specialists and contractors Equipment State-‐of-‐the-‐art supercomputers and data handling systems Budget £50 million per year
Reading, United Kingdom
6
Numerical Weather Predic0on
13/05/15
7
Model
8
A basic descrip0on of our models
9
Data sources for the ECMWF Meteorological Opera0onal System
10
Major assimilated datasets Receive 300 million observa0ons from 130 sources daily. Surface sta:ons
Polar, infrared
Radiosonde balloons
AircraC
Geosta:onary, IR
Polar, microwave
4D-‐Var Assimila0on
12
Data Assimila0on
short-‐range forecast
measurements
corrected representa0on of the atmosphere 13
Model Levels Current opera0onal: 137 levels Alt. ≈40.000m
14
Opera0onal model grid High Resolu:on (16km)
ENS (32Km)
Grid Effect OROGRAPHY, GRID POINTS AND LAND SEA MASK IN TL 639 (EPS 2010) ECMWF MODEL orography shaded (height in m), land grid points (red), sea grid points (blue) 5°E 10°E 15°E 2359.1 2300 200
2200 2100 2000 1900 1800 1700 1600 1500
400
1400 1300 1200 1100 OROGRAPHY, GRID POINTS AND LAND SEA MASK IN TL 1279 (OP 2010) ECMWF MODEL 200 1000 orography shaded (height in m), land grid points (red), sea grid points (blue) 45°N 5°E 15°E 900 10°E
45°N
200
800
0 40
3000
700 200
600
2800
500 2600
20 0
400 300
2400
200
400
2200
100 10
5°E
10°E
2000
15°E
1800 1600 1400 45°N
45°N
1200 1000 800 600 400 200 10
5°E
10°E
15°E
16
Meteorological Fields Fields: 2M grid points x 137 levels 287M values, per variable Opera0onal models produce: – 13 millions fields daily – Totalling 14 TB/day Time cri:cal window: 1h !!!
17
Chaos and weather predic0on The atmosphere is a chao:c system • Small errors can grow to have major impact (buberfly effect) • Can never perfectly measure the state of the atmosphere • Limits weather predic0on to a week or so ahead Ensemble Forecasts • Parallel set of forecasts from very
slightly different ini0al condi0ons and model formula0on • Assess uncertainty of today’s forecast
18
Why run an ensemble predic0on system? Basic idea • Taking account of uncertainty • Forecas0ng forecast skill Forecas:ng benefits • Assess uncertainty of today’s forecast • Provide alterna0ve forecast scenarios • Highlight the predictable (large-‐scale) component and the risk for a less likely but significant (small-‐scale) event Con:nuing challenges • Forecas0ng extreme events • Extending the forecast range
AN 19871016, 06GMT
MSLP 66-‐h forecasts for 16-‐Oct-‐1987 Mul0-‐valued forecast, the EPS
EPS Cont FC +66 h
956 979 - mem no. 1 of 51 +66 h
- mem no. 2 of 51 +66 h
- mem no. 3 of 51 +66 h
- mem no. 4 of 51 +66 h
978 984 - mem no. 11 of 51 +66 h
963
- mem no. 7 of 51 +66 h
- mem no. 8 of 51 +66 h
- mem no. 9 of 51 +66 h
- mem no. 14 of 51 +66 h
962 - mem no. 15 of 51 +66 h
- mem no. 16 of 51 +66 h
988 - mem no. 17 of 51 +66 h
- mem no. 18 of 51 +66 h
- mem no. 21 of 51 +66 h
- mem no. 19 of 51 +66 h
- mem no. 23 of 51 +66 h
- mem no. 24 of 51 +66 h
- mem no. 31 of 51 +66 h
990
966 - mem no. 32 of 51 +66 h
970
- mem no. 33 of 51 +66 h
- mem no. 25 of 51 +66 h
976
965 - mem no. 26 of 51 +66 h
- mem no. 34 of 51 +66 h
- mem no. 28 of 51 +66 h
983
- mem no. 35 of 51 +66 h
- mem no. 36 of 51 +66 h
- mem no. 42 of 51 +66 h
980
- mem no. 43 of 51 +66 h
986 977
979
- mem no. 44 of 51 +66 h
981
972
974
- mem no. 29 of 51 +66 h
- mem no. 30 of 51 +66 h
964
975
980
962
967
- mem no. 37 of 51 +66 h
- mem no. 38 of 51 +66 h
972
964
988 985
970 979
980
- mem no. 41 of 51 +66 h
- mem no. 27 of 51 +66 h
982
979 961
- mem no. 20 of 51 +66 h
984
965 - mem no. 22 of 51 +66 h
984
969
966
979 964
- mem no. 10 of 51 +66 h
981
981
- mem no. 13 of 51 +66 h
- mem no. 6 of 51 +66 h
979
968
- mem no. 12 of 51 +66 h
- mem no. 5 of 51 +66 h
- mem no. 45 of 51 +66 h
- mem no. 46 of 51 +66 h
964 - mem no. 47 of 51 +66 h
- mem no. 39 of 51 +66 h
- mem no. 40 of 51 +66 h
978 978
988 - mem no. 48 of 51 +66 h
- mem no. 49 of 51 +66 h
960
- mem no. 50 of 51 +66 h
980 976
958 968
987
963
989
20
Product Genera0on 77 million products disseminated ever day, totalling 6 TB. Interpolate output fields into user required grids Product genera0on is also subject to a dissemina:on schedule... Time cri:cal window: 1h !!!
21
RMDCN Connec0ons
22
Products also served via web visualisa:on services (demo)
13/05/15
23
Challenges Ahead Scalable Algorithms Plaiorm Uncertainty Big Data 13/05/15
24
New Algorithms Challenge To perform beber forecasts: • Assimilate beber & more observa0ons • Keep increasing model resolu0on • Add modeling of new physical phenomena • Add more ensemble members More computa:ons All run within a cri:cal 1h :me window 25
Evolu0on of ECMWF scores comparison northern and southern hemispheres
~39km ~210km
~125km
~25km
~16km
~63km
26
Benefits of High Resolu0on Mean sea-‐level pressure AN 30 Oct 5d FC T3999 5d FC T1279
5d FC T639
Sandy 28 Oct 2012
Precipita:on: NEXRAD 27 Oct
4d FC T639
3d FC: Wave height
4d FC T1279
Mean sea-‐level pressure
4d FC T3999
27
Planned Resolution Upgrades
9Km by end 2015
28
Archive size vs. Supercomputer power 100000
100000
HPC (GFLOPs) Archive (TB)
10000
10000
1000
1000
100
100
10
10
1
1
IBM-P6 (Jul 2009)
IBM-P5+ (Jan 2007)
IBM-P5 (Jul 2004)
IBM-P4 (Dec 2002)
VPP5000 (Apr 1999)
VPP700-112 (Oct 1997)
VPP700/48 (Jun 1996)
C90/16 (Jan 1993)
C90/12 (Jan 1992)
X-MP/8 (Jan 1990)
0.01
X-MP/4 (Jan 1986)
0.01
X-MP/2 (Nov 1983)
0.1
Cray-1A (Nov 1978)
0.1
Archival rate has always been strongly correlated with computa0onal power
Resolu0on Upgrades Resolu:on
Grid size
Grid Points
Field Size (in memory)
T319
62.5 km
204 k
1.6 MB
T511
39 km
524 k
4 MB
T799
25 km
1.2 M
9.6 MB
T1279
16 km
2.1 M
16.8 MB
T2047
10 km
8.4 M
67.2 MB
T3999
5 km
20 M
160 MB
T7999
2.5 km
80 M
640 MB
As memory per core diminishes (think GPU’s and Intel Phi) … … this may have serious implica0ons on the data processing sonware! 30
Power Challenge → 25% of HPC used for opera0ons → 10% of opera0ons for HRES forecast → 2.5% of HPC
Time cri:cal limit 240 days/day
2 MW
6 MW
→ 2.5 km* forecast on Cray XC-‐30 (84,000 cores) = 2 MW → 240 days/day require 270,000 cores = 6 MW → x 10** = 60 MW ? * planned for 2025 ** scaling from HRES to full HPC 31
Plaiorm Uncertainty Challenge • How to keep up with the changing hardware landscape? • How to best use the new heterogeneous systems? – CPU’s – GPU’s – Accelerators
• With very limited human resources… 32
Hardware Landscape Prolifera0on of HPC Architectures
Intel Xeon
Intel Xeon Phi
IBM Power8
NEC SX-ACE Aurora
NVidia Tesla
ThunderX ARM
AMD FirePro
X-Gene ARM
13/05/15
33
Power Mabers “Heterogeneous High Throughput Scien2fic Compu2ng with APM X-‐Gene and Intel Xeon Phi”, Adburachmanov et. al., 2014
13/05/15
34
Hypothe0cal Solu0on • • • •
Need flexibility? Need performance? Code easier to maintain? A modern language? We have a solu:on…
C++ 13/05/15
35
But …
13/05/15
36
HPC Plaiorm Restric0ons • POSIX • Only supported languages: – Fortran – C99 – Recently C++ (but not C++11) – Shell (Ksh) + Perl (but not Python)
• Most of our scien0sts know – Fortran – Python (maybe) – Basic shell scripting 13/05/15
37
C++ Support on HPC • Support for C++ in HPC plaiorms is … patchy! • Some HPC compilers aren’t fully C++98 and choke on heavily template code: – Eigen library for linear algebra – Parts of Boost • serializa0on • Proto
• Not to men0on C++11! Change on the way? Community & Vendors start talking about C++ 13/05/15
38
Scien0sts are reluctant to use C++ Possible explana0ons (my view): • Fortran is what they know • Fortran is very fast for numeric computa0ons • C++ was (ini0ally) perceived as slow: – hidden constructors – misuse of classes – restrict keyword Mostly corrected now (e.g. work of T. Veldhuizen)
• But C++ is s0ll perceived as … – complex (templates are a common complaint) – less portable (than C or Fortran) 13/05/15
39
Big Data Challenge “Big Data is high volume, high velocity, and/or high variety informa0on assets that require new forms of processing to enable enhanced decision making, insight discovery and process op0miza0on.” “3D Data Management: Controlling Data Volume, Velocity and Variety”, D. Laney, Gartner, 2001
The 3 V’s of Big Data 40
V is for Volume: Observa0ons Increase of satellite data usage
41
V is for Volume: Products
42
V is for Volume: Archive
Dele0on of 1 PB
43
V is for Velocity •
ECMWF’s archive grows exponen0ally: Initial volume
Volume of the archive
Time
Rate of growth
– r is around 0.5, which is a 50% increase per year – The rate of added data also grows exponentially at the same rate!
• •
In 1995, the size of the archive was increasing at a rate of 14 TB/year. In 2015, the size of the archive increases at a rate of 100 TB/day
44
V is for Variety X-MP/4
Y-MP/8
C90/12
VPP700-48
VPP5000
IBM-P4
IBM-P5
IBM-P5+
IBM-P6
10T C90/16
VPP700-112
T1279L91
T319L50
T106L19 T213L31
T511L60
T319L31
Weekly Monthly
T799L91
T106L16
1T
T319L60 Ensemble data assimilation
00Z EPS
00Z 10 day FC
Overlap, CalVal
Vareps/Monthy 4d-Var Model errors
Wave EFIs
SCDA 4D-Var
100G
DCDA
4D-var increments Extra fields, new gaussian grid
Weekend EPS
Wave proba.
SCDA Forecast
Wave EPS
4D-Var
00Z Run
1G
Ensemble means & stdev
Clusters
Chernobyl
Probabilities Waves
SSTs
FC Model levels
1/3 growth is resolu:on increase 2/3 growth is increase of product types
Waves FG
EPS
100M
Tubes
Errors in AN and FG NCEP EPS
Other centers
Errors in FG
SCDA Analysis
Errors if FG, surface
OI
Sensitivity
00Z Run
Multi-Analysis
Wave 4V PT and PV levels
End sensitivity
10G
DCDA Wave EFIs
SCDA Waves SCDA Forecast
EPS PT levels
TOGA FC
3DVar
FC Pressure levels
4DVar
12 Hour 4DVar
DCDA
Vareps/Monthy
EDA
EPS 15 days
50 Members EPS
10M 85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
00
01
02
03
04
05
06
07
08
09
45
10
Nothing of this is new We have always dealt with Big Data… What changed?
46
CPU Power Growth “The Free Lunch is Over”. H. SuRer, Dr. Dobb's Journal, 30(3), March 2005
But what about “real” performance? 47
CPU Performance Growth (single-‐threaded) “A Look Back at Single-‐Threaded CPU Performance”, J. Pershing Feb 2012
More registers, vector units, branch predic0on … but also harder to achieve! Deeper memory hierarchy, out-‐of-‐order execu:on … 48
Storage Density Growth Mul0ple Technologies “Tape based magne2c recording: technology landscape comparisons with hard disk drive and flash roadmaps”, R. Fontana et al, IBM Research Division, 2011
49
HDD Storage Growth “GPFS Scans 10 Billion Files in 43 minutes”. R. Freitas, et al. IBM Research Division, 2011
Volume is linearly propor0onal to area density … but transac0on rate hasn’t kept up! Recently follows 25-‐40% CAGR… This means that we may have the capacity, but maybe not the bandwidth … 50
What does it imply? • “No Free Lunch” è Improve our sonware – NWP models have been parallel and concurrent for many decades
• Op0mize Data Movement • Develop new Algorithms that explore… – (More) Concurrent computa0ons – Data locality – Computa0onal intensity (CPU usage/MB transferred) • Sonware must cope with changes – Flexibility – Best use of new hardware – Unknown future for parallel plaiorms
51
Learning from the Past MARS B. Raoult, M. Fuentes
13/05/15
52
Meteorological Archival and Retrieval System A managed archive, not a file system • Users not aware of data loca0on • Express access in meteorological terms Data is kept forever: • As data is accumulated, datasets become more useful • Dele0ng old data in an exponen0ally growing archive is meaningless Consists of 3 layers: • Cache in the HPC (~80% hit ra0o) • Servers HDD cache (~80% hit ra0o) • HPSS Tape system 53
Meteorological Archival and Retrieval System Fully distributed – 15 servers for metadata and data movers – 80 PB primary archive – 1.5 PB of disk cache (2.5%) – 110 billion fields in 8.5 million files (2014) – 200 million objects (2014) – 100 TB added daily – 100 TB retrieved per day – 2 million requests per day
• • • •
Fully wriben in C++ Since 1995 Index is Object Oriented DB Persisted in filesystem 54
MARS 2011 Migra0on • • • • •
From monolythic to distributed From AIX to Linux From Big Endian to Lible Endian Metadata indexing 40PB Job 2 years in the planning
• Degraded service for a couple of hours • Noone no0ced... 13/05/15
55
A meteorological language • retrieve, date parameter type step levtype levels grid area
= = = = = = = =
20110101/to/20110131, temperature/geopoten0al, forecast, 12/to/240/by/12, pressure levels, 1000/850/500/200, 2/2, -‐10/20/10/0
• This request: – represents 31*2*20*4 = 4960 fields – defines an interpolation followed by cropping 56
Indirec:on is the key to scalability
13/05/15
57
Preparing the Future Atlas Hermes 13/05/15
58
Aims • Big data challenge (scalability) • Adaptable solu0on (flexibility) • Code evolu0on (maintainability)
Lets use C++ (98) 13/05/15
59
Atlas W. Deconinck, T. Quin0no
Challenges: New Algorithms & Plalorm Uncertainty
13/05/15
60
Current IFS Model • • • • • • • •
2M lines of Fortran code 9000 rou0nes / modules Hybrid MPI + OpenMP Fixed data-‐structure Well organized but… … lots of global data. Imprac0cal to rewrite Fortunately most global opera0ons are located in the “dynamical core”. 13/05/15
61
Atlas • Framework for parallel, dynamic data structures • Suppor0ng mul0ple types of grids • Fully wriben in C++ • Providing Fortran 2003 interfaces • Basis to develop scalable dynamical core
13/05/15
62
Atlas capabili0es
13/05/15
63
So what about Fortran? • Keep Fortran for numeric algorithms • C++ for … – All data & memory management – Message passing
• For every C++ class -‐> Fortran Derived Type • Use Fortran ISO C Bindings (F2003)
13/05/15
64
C++ contains implementa0on // C++ implementation class Field { public: Field(int size) { data_.resize(size); } double* data() { return &data_[0]; } private: vector data_; } // C interface – to be called from Fortran Field* atlas__new_Field (int size) { return new Field( size ); } void atlas__Field__access_data(Field* this, double& data, int& s){ data = this->data(); size = this->size(); } 13/05/15
65
WARNING: Fortran code ahead!
13/05/15
66
Fortran/C interface ! Fortran Derived type , contains no data , only methods F03 TYPE :: Field_type PRIVATE TYPE(C_PTR) :: object = C_NULL_PTR CONTAINS PROCEDURE, PUBLIC :: access_data => Field__access_data END TYPE ! Method calls C function, which uses C++ implementation SUBROUTINE Field__access_data(this,data) CLASS(Field_type), intent(in) :: this TYPE(C_PTR) :: cptr_data REAL(JPRB), POINTER :: data INTEGER :: size ! This is the C function CALL atlas__Field__access_data(this%object ,cptr_data ,size) ! Converts the C pointer to a fortran pointer to array of doubles CALL C_F_POINTER(cptr_data ,data ,(/size/)) END SUBROUTINE
13/05/15
67
Fortran Program PROGRAM main TYPE(Field_type) :: field REAL(JPRB), POINTER :: field_data(:) INTEGER :: j ! We can ask the field for access to its data CALL field%access_data(field_data) ! Now we can use the data as any other Fortran array DO j=1,size(field_data) field_data(j) = 0. ENDDO END PROGRAM
• Furthermore, we provide another higher level API to hide Fortran 2003. 13/05/15
68
Hermes T. Quin0no, B. Raoult, F. Rathgeber
Challenges: Big Data & Plalorm Uncertainty
13/05/15
69
A look at the data-‐chain Observa:ons
Obs1
Obs2
Obs1 IFS Model Obs3
Obs2
Obs4
13/05/15
70
A look at the data-‐chain IFS Model Raw Model Output
PG1
PG2 PG3 Dissemina:on
PG4 PG1
PG3 Products
13/05/15
71
Observa0ons & Fields: Some Similari0es • Both processed in pipelines • Both are parallel… – Concurrent (Threads / OpenMP) – Distributed (MPI) How does it differ from an atmospheric Model? • Parallel, but little or no synchronisation • No time evolution • “Embarrassingly parallel” (nearly) Possible: Automatic Parallelisation 13/05/15
72
Requirements • • • • •
Way to define computa0ons Efficient parallel computa0on Way to accept data and deliver results Flexibility for future architectures Extendable and understandable by scien0sts
13/05/15
73
Solu0on Indirec:on is key to Scalability
Decouple what to compute from how and where to compute
13/05/15
74
Use Case: Product Genera0on tasks
supercomputer
t1 User Requests
f1 generate
t2
dispatch
t3 f2
t1 t2
map ( interpolate, list ( f1, f2, ... ) )
Frontend (define)
Linux cluster
worksta0on
Backend (compute) 13/05/15
75
Use Case: Observa0on Filters supercomputer
observa0ons map ( f, list ( … ) )
f map ( g•f, list ( … ) )
f
dispatch
g
map ( g, list ( … ) )
g
Linux cluster
map ( h•g•f, list ( … ) )
f
g
h worksta0on
Frontend (define)
Backend (compute) 13/05/15
76
Use Case: Worksta0on Interpola0on & Visualisa0on dispatch
Fields T8000 (N x 670MB) worksta0on map ( interpolate, list ( … ) )
Frontend (define)
Fields N640 (N x 10MB)
Interpola0on cluster
Backend (compute) 13/05/15
77
Hermes • • • •
DSL to define computa0ons Frontend for sending requests and data Broker to dispatch work to PU’s Mul0ple computa0onal back-‐ends: – CPU – CUDA (Linear Algebra) – Intel MKL (Linear Algebra) – Intel TBB ?? – HPX ?? 13/05/15
78
Dispatcher Producers Thread 1
pick push
Task 1
Consumers Picker
Task 2 Thread 2
Task 3
handle Handler
push
Thread 1 Thread 2
Task 4 Thread 3 … Thread M
Task 5 Task 6 … Task N
Thread 3 … Thread M
MessageQueue
13/05/15
79
Expression Engine • • • • •
Prototype DSEL Embedded in C++ Plug-‐ins in Fortran (not yet) Computa0ons as expression trees Hierarchy of expressions: – terminals (value semantics) – functions (pure operations)
13/05/15
80
Expression Engine • • • • • •
Inspired by func0onal language Haskell Expressions immutable Func:ons pure No variables, only constants Lazy evalua0on Op0miser: find most efficient path to evaluate expression – Pattern matching (e.g. linear combination) – Collapse expression tree into terminal – Detect common sub-expressions – Automatic parallelisation 13/05/15
81
Expression typedef std::shared_ptr ExpPtr; typedef std::vector< ExpPtr > args_t; class Expression { public: ... Expression(const args_t& args); size_t arity() const { return args_.size(); } ... ExpPtr self(); ExpPtr eval() const { Scope s; return ExpPtr(optimise()-‐>evaluate(s)) }; ExpPtr eval(Scope& s) const { return ExpPtr(optimise()-‐>evaluate(s)) }; ... // used to bind lambda parameters virtual ExpPtr resolve(Scope&) const { return self(); } virtual ExpPtr optimise(size_t depth) const { return self(); } template shared_ptr as() { ... } private: virtual ExpPtr evaluate(Scope&) const = 0; virtual ExpPtr cloneWith(args_t&) const = 0; virtual std::string signature() const = 0; }; 13/05/15
82
Values class Value : public Expression { protected: Value( const args_t& args ) : Expression(args) {} private: virtual ExpPtr evaluate(Scope&) const { return self(); } };
class Real : public Value { public: typedef real_t value_t; ... Real(ExpPtr e); Real(const real_t& v); ... value_t value() const { return v_; } static value_t extract (const ExpPtr& e) { return e-‐>as()-‐>value();} ... private: virtual std::string signature() const { return "r"; } };
Bool, Integer, Real, String, List, Vector, Matrix 13/05/15
83
Func0ons class Function: public Expression { public: typedef ExpPtr (*func_t) (Scope&,const args_t&); typedef std::map< key_t, func_t > dispatcher_t; virtual ExpPtr evaluate(Scope&) const; virtual ExpPtr optimise(size_t depth) const; virtual std::string signature() const { return signatureArguments(args()); } protected: Function( const args_t& args ) : Expression(args) {} // returns something like "func(r,v,r,v)" std::string signatureArguments(const args_t& args) const {...} private: virtual ExpPtr evaluate(Scope&) const { return self(); } };
13/05/15
84
Func0ons (2) ExpPtr Function::evaluate(Scope& ctx) const { // create temporary args args_t tmp = args(); for(size_t i = 0; i < tmp.size(); ++i) { tmp[i] = args(i, ctx, true)-‐>self(); // evaluate the arguments } std::string sig = signatureArguments(tmp); // new signature dispatcher_t& d = dispatcher(); // find dispatching function dispatcher_t::iterator itr = d.find(sig); if(itr == d.end()) { ... fail ... } return ((*itr).second)( ctx, tmp ); // dispatch } ExpPtr Function::optimise(size_t depth) const { // revert to generic optimizer if Expression doesn’t optimize itself return Optimiser::apply(self(), depth); }
13/05/15
85
Func0ons (3) class Map : public Function { public: Map(ExpPtr func, ExpPtr lst ); private: ... ExpPtr Map::evaluate(Scope& ctx) const { ExpPtr f = args(0, ctx, false); // get the function, don’t evaluate ExpPtr ls = args(1, ctx, true); // get the list and evaluate if necessary const List::value_t& list = List::extract(ctx, ls); List::value_t res; res.reserve(list.size()); for( size_t i = 0; i < list.size(); ++i ) { ExpPtr e = list[i]-‐>resolve(ctx)-‐>eval(ctx); // evaluate each element ExpPtr v = f-‐>eval(e); res.push_back( v ); } return ExpPtr(new List(res)); } ... }; 13/05/15
86
Func0ons (4) Map Reduce Merge Take ZipWith IfElse Call Lambda
Filter Bind UnaryOperators • neg,sqrt,abs BinaryOperators • mul,add,sub,div • mod,min,max Predicates
13/05/15
87
Examples auto a = real(2.); auto b = real(4.); auto x = vector({1.,2.,3.}); auto y = vector(3, 5.); auto c = add(a,b); std::cout << *c << std::endl; // Add(Real(2), Real(4)) auto e = add(mul(c, x) , mul(b, y)); c = c-‐>eval(); std::cout << *c << std::endl; // Real(6) // Add(Mul(Add(Real(2), Real(4)), Vector(1, 2, 3)), Mul(Real(4), Vector(5, 5, 5))) std::cout << *e << std::endl; auto opt = e-‐>optimise(true); std::cout << opt-‐>signature() << std::endl; // optimised to "Linear(r,v,r,v)" auto f1 = reduce(mul(), list(a, x, x))-‐>eval(); // Vector(2, 8, 18) 13/05/15
88
Examples class Xpr provides value sema0cs and convinience operators // lambdas Xpr twice = call(lambda("x", Xpr(2.0) * Xpr("x"))); // define 2*x Xpr l = list(a,b,c); Xpr r = map(twice, l); std::cout << r() << std::endl; // List(Real(4), Real(8), Real(12)) // currying Xpr F = lambda("i", call( lambda("j", Xpr("i") * Xpr("j")))); // define i*j Xpr H = call(F, Xpr( 2. )); // curry to i*2 Xpr G = H( 3. ); // curry to 3*2 std::cout << G() << std::endl; // Real(6)
13/05/15
89
Examples // Define the Y-‐combinator Xpr Y = lambda("f", call(Xpr("f"), Xpr("f"))); // lookup table for first 10 elements of series Xpr table = list( Xpr(0.), Xpr( 1.), Xpr( 1.), Xpr( 2.), Xpr( 3.), Xpr(5.), Xpr(8.), Xpr(13.), Xpr(21.), Xpr(34.), Xpr(55.) ); // non-‐recursive fibonacci function Xpr fib = lambda("f", call(lambda("n", ifelse( Xpr("n") > Xpr(10.), Xpr(call(Xpr("f"),Xpr("f"), Xpr("n”)-‐Xpr(1.0))) + Xpr(call(Xpr("f"),Xpr("f"), Xpr("n”)-‐Xpr(2.0))), take( Xpr("n"), table ) ) ) ) ); Xpr fibonnaci = call(Y, fib); // applying the Y-‐combinator std::cout << fibonnaci(20.) << std::endl; // Real(6765) 13/05/15
90
Arithme0c Expression 5 * 7 + 15 == add ( prod ( 5, 7 ), 15 )
sum sum prod
15 35
5
15
50
7
constant folding and propaga0on 13/05/15
91
Filter Expression filter ( greater ( ?, 2 ), list ( -‐2, 4, 1, 5 ) )
filter list greater ?
list 2
-‐2
4
1
5
4
5
expression elimina0on 13/05/15
92
Future work: Map fusion map ( h, map ( g, map ( f, list ( … ) ) ) )
map f
map
map g
f g h
map h
list
list 13/05/15
93
Future work Distributed interpola2on of fields write ( average ( map ( interpolate, list ( field1, field2, …, field6 ) ) ) ) write average
distribute
map interpolate
field1
field2
field3
list field4
field5
field6 13/05/15
94
Future work Distributed interpola2on of fields write ( average ( map ( interpolate, list ( field1, field2, …, write field6 ) ) ) )
automa0c parallelisa0on
average list
list
thread 1
interpolate
interpolate
interpolate
interpolate
field3
field4
field5
field6
thread 1
interpolate
interpolate
field1
field2
Host a
list
thread 2
thread 2
thread 1
list
Host b
thread 2
Host c 13/05/15
95
Linear Algebra Expressions norm ( sub ( vector, vector ) )
norm norm
sub vector
vector
real
vector norm of difference of two vectors 13/05/15
96
Dispatch to Compute Engines GPU engine
dot vector
dispatch
vector dot
matrix
vector
Expression signatures
MKL engine
Dot(v,v) Dot(m,v) Dot(m,m)
cblas_ddot cblas_dgemv cblas_dgemm Generic engine
dot matrix
cublasDdot cublasDgemv cublasDgemm
matrix
ddot dgemv dgemm
BLAS/ LAPACK 13/05/15
97
Expressions: client code const size_t ncols = 1024; const size_t nrows = 1024; auto init = [](size_t, size_t j=0) { return (real_t)std::rand() / RAND_MAX; }; XprClient c; // L2 norm of difference of 2 vectors c.send( norm( sub( vector(ncols, init), vector(ncols, init) ) ) ); // Dot product of vectors c.send( dot( vector(ncols, init), vector(ncols, init) ) ); // Dot product of matrix and vector c.send( dot( matrix(nrows, ncols, init), vector(ncols, init) ) ); // Dot product of matrix and matrix c.send( dot( matrix(nrows, ncols, init), matrix(ncols, nrows, init) ) ); 13/05/15
98
Generic Engine implementa0on ExpPtr compute_dot_generic(Scope&, const args_t &p) { Vector::value_t v1 = Vector::extract(p[0]); Vector::value_t v2 = Vector::extract(p[1]); ASSERT( v1.size() == v2.size() ); Vector::elemt_t r; for( size_t i = 0; i < v1.size(); ++i ) r += v1[i] * v2[i]; return real(r); } ExpPtr compute_gemv_generic(Scope&, const args_t &p); ExpPtr compute_gemm_generic(Scope&, const args_t &p); // Register generic functions in dispatch table struct DotRegister { DotRegister() { Function::dispatcher()["xpr::Dot(v,v)"] = &compute_dot_generic; Function::dispatcher()["xpr::Dot(m,v)"] = &compute_gemv_generic; Function::dispatcher()["xpr::Dot(m,m)"] = &compute_gemm_generic; } }; Static DotRegister dotRegister;
13/05/15
99
MKL Engine implementa0on ExpPtr compute_dot_MKL(Scope&, const args_t &p) { const Vector::value_t& v1 = Vector::extract(p[0]); const Vector::value_t& v2 = Vector::extract(p[1]); ASSERT( v1.size() == v2.size() ); Vector::elemt_t r; return real( cblas_ddot( v1.size(), v1.data(), 1, v2.data(), 1 ) ); } ExpPtr compute_gemv_MKL(Scope&, const args_t &p) ExpPtr compute_gemm_MKL(Scope&, const args_t &p) // Register MKL functions in dispatch table void registerMKL() { Function::dispatcher()["xpr::Dot(v,v)"] = &compute_dot_MKL; Function::dispatcher()["xpr::Dot(m,v)"] = &compute_gemv_MKL; Function::dispatcher()["xpr::Dot(m,m)"] = &compute_gemm_MKL; }
13/05/15
10 0
GPU Engine implementa0on ExpPtr compute_dot_CUDA(Scope&, const args_t &p) { const Vector::value_t& x = Vector::extract(p[0]); const Vector::value_t& y = Vector::extract(p[1]); ASSERT( x.size() == y.size() ); const size_t size = x.size()*sizeof(real_t); Vector::elemt_t r; real_t* d_x; ///< device memory vector x real_t* d_y; ///< device memory vector y cublasHandle_t handle; CALL_CUDA( cudaMalloc((void**) &d_x, size) ); CALL_CUDA( cudaMalloc((void**) &d_y, size) ); CALL_CUDA( cudaMemcpy(d_x, x.data(), size, cudaMemcpyHostToDevice) ); CALL_CUDA( cudaMemcpy(d_y, y.data(), size, cudaMemcpyHostToDevice) ); CALL_CUBLAS( cublasCreate(&handle) ); CALL_CUBLAS( cublasDdot(handle, x.size(), d_x, 1, d_y, 1, &r) ); CALL_CUBLAS( cublasDestroy(handle) ); CALL_CUDA( cudaFree(d_x) ); CALL_CUDA( cudaFree(d_y) ); return real( r ); } 13/05/15
10 1
Live Demo
13/05/15
10 2
Ques0ons?
13/05/15
10 3