Numerical  Weather  Predic0on   Facing  the  Future  with  C++  

T. Quintino, F. Rathgeber, W. Deconinck, B. Raoult, M. Fuentes ECMWF C++Now 2015, May 13 2015 13/05/15

1

Overview   •  •  •  •  • 

ECMWF   Numerical  Weather  Predic0on   Challenges  Ahead   Leaning  from  the  Past   Preparing  the  Future  

13/05/15

2

ECMWF  

13/05/15

3

ECMWF   An  independent   intergovernmental   organisa0on     established  in  1975     with   21  Member  States   13  Co-­‐opera0ng  States  

13/05/15

4

Who  are  we  and  what  do  we  do?     Weather  Forecasts Medium-­‐Range

 We  produce   global  weather  forecasts    Up  to  15  days  ahead.   Our  products  also  include   monthly  and  seasonal  forecasts   and  we  collect  and  store   meteorological  data.  

  What  do  we  have  to  achieve  this?   People  About  260  staff,   specialists  and  contractors   Equipment  State-­‐of-­‐the-­‐art  supercomputers   and  data  handling  systems   Budget  £50  million  per  year  

Reading,  United  Kingdom  

   

6

Numerical  Weather  Predic0on  

13/05/15

7

Model  

8

A  basic  descrip0on  of  our  models      

9

Data  sources  for  the  ECMWF  Meteorological  Opera0onal  System  

10

Major  assimilated  datasets   Receive  300  million  observa0ons     from  130  sources  daily.     Surface   sta:ons  

Polar,   infrared  

Radiosonde   balloons  

AircraC  

Geosta:onary,  IR  

Polar,   microwave  

4D-­‐Var  Assimila0on  

12

Data  Assimila0on  

short-­‐range  forecast  

measurements  

corrected  representa0on  of  the  atmosphere     13

Model  Levels   Current  opera0onal:   137  levels     Alt.    ≈40.000m    

14

Opera0onal  model  grid   High  Resolu:on  (16km)  

ENS  (32Km)  

Grid  Effect   OROGRAPHY, GRID POINTS AND LAND SEA MASK IN TL 639 (EPS 2010) ECMWF MODEL orography shaded (height in m), land grid points (red), sea grid points (blue) 5°E 10°E 15°E 2359.1 2300 200

2200 2100 2000 1900 1800 1700 1600 1500

400

1400 1300 1200 1100 OROGRAPHY, GRID POINTS AND LAND SEA MASK IN TL 1279 (OP 2010) ECMWF MODEL 200 1000 orography shaded (height in m), land grid points (red), sea grid points (blue) 45°N 5°E 15°E 900 10°E

45°N

200

800

0 40

3000

700 200

600

2800

500 2600

20 0

400 300

2400

200

400

2200

100 10

5°E

10°E

2000

15°E

1800 1600 1400 45°N

45°N

1200 1000 800 600 400 200 10

5°E

10°E

15°E

16

Meteorological  Fields   Fields:   2M  grid  points  x  137  levels   287M  values,  per  variable      Opera0onal  models  produce:     –   13  millions  fields  daily     –  Totalling  14  TB/day       Time  cri:cal  window:  1h  !!!  

17

Chaos  and  weather  predic0on   The  atmosphere  is  a  chao:c  system   •  Small  errors  can  grow  to  have   major  impact  (buberfly  effect)   •  Can  never  perfectly  measure  the   state  of  the  atmosphere   •  Limits  weather  predic0on  to  a   week  or  so  ahead   Ensemble  Forecasts   •  Parallel  set  of  forecasts  from  very  

slightly  different  ini0al  condi0ons   and  model  formula0on   •  Assess  uncertainty  of  today’s   forecast  

18

Why  run  an  ensemble  predic0on  system?   Basic  idea   •  Taking  account  of  uncertainty   •  Forecas0ng  forecast  skill   Forecas:ng  benefits   •  Assess  uncertainty  of  today’s  forecast   •  Provide  alterna0ve  forecast  scenarios   •  Highlight  the  predictable  (large-­‐scale)   component  and  the  risk  for  a  less  likely   but  significant  (small-­‐scale)  event   Con:nuing  challenges   •  Forecas0ng  extreme  events   •  Extending  the  forecast  range  

AN 19871016, 06GMT

MSLP  66-­‐h  forecasts  for  16-­‐Oct-­‐1987   Mul0-­‐valued  forecast,  the  EPS  

EPS Cont FC +66 h

956 979 - mem no. 1 of 51 +66 h

- mem no. 2 of 51 +66 h

- mem no. 3 of 51 +66 h

- mem no. 4 of 51 +66 h

978 984 - mem no. 11 of 51 +66 h

963

- mem no. 7 of 51 +66 h

- mem no. 8 of 51 +66 h

- mem no. 9 of 51 +66 h

- mem no. 14 of 51 +66 h

962 - mem no. 15 of 51 +66 h

- mem no. 16 of 51 +66 h

988 - mem no. 17 of 51 +66 h

- mem no. 18 of 51 +66 h

- mem no. 21 of 51 +66 h

- mem no. 19 of 51 +66 h

- mem no. 23 of 51 +66 h

- mem no. 24 of 51 +66 h

- mem no. 31 of 51 +66 h

990

966 - mem no. 32 of 51 +66 h

970

- mem no. 33 of 51 +66 h

- mem no. 25 of 51 +66 h

976

965 - mem no. 26 of 51 +66 h

- mem no. 34 of 51 +66 h

- mem no. 28 of 51 +66 h

983

- mem no. 35 of 51 +66 h

- mem no. 36 of 51 +66 h

- mem no. 42 of 51 +66 h

980

- mem no. 43 of 51 +66 h

986 977

979

- mem no. 44 of 51 +66 h

981

972

974

- mem no. 29 of 51 +66 h

- mem no. 30 of 51 +66 h

964

975

980

962

967

- mem no. 37 of 51 +66 h

- mem no. 38 of 51 +66 h

972

964

988 985

970 979

980

- mem no. 41 of 51 +66 h

- mem no. 27 of 51 +66 h

982

979 961

- mem no. 20 of 51 +66 h

984

965 - mem no. 22 of 51 +66 h

984

969

966

979 964

- mem no. 10 of 51 +66 h

981

981

- mem no. 13 of 51 +66 h

- mem no. 6 of 51 +66 h

979

968

- mem no. 12 of 51 +66 h

- mem no. 5 of 51 +66 h

- mem no. 45 of 51 +66 h

- mem no. 46 of 51 +66 h

964 - mem no. 47 of 51 +66 h

- mem no. 39 of 51 +66 h

- mem no. 40 of 51 +66 h

978 978

988 - mem no. 48 of 51 +66 h

- mem no. 49 of 51 +66 h

960

- mem no. 50 of 51 +66 h

980 976

958 968

987

963

989

20

Product  Genera0on   77  million  products   disseminated  ever  day,   totalling  6  TB.   Interpolate  output  fields  into   user  required  grids   Product  genera0on  is  also   subject  to  a  dissemina:on   schedule...   Time  cri:cal  window:  1h  !!!    

21

RMDCN  Connec0ons  

22

Products  also  served  via  web   visualisa:on  services   (demo)  

13/05/15

23

Challenges  Ahead   Scalable  Algorithms   Plaiorm  Uncertainty   Big  Data   13/05/15

24

New  Algorithms  Challenge   To  perform  beber  forecasts:   •  Assimilate  beber  &  more  observa0ons   •  Keep  increasing  model  resolu0on   •  Add  modeling  of  new  physical  phenomena   •  Add  more  ensemble  members   More  computa:ons   All  run  within  a  cri:cal  1h  :me  window   25

Evolu0on  of  ECMWF  scores  comparison   northern  and  southern  hemispheres  

~39km   ~210km  

~125km  

~25km  

~16km  

~63km  

26

Benefits  of  High  Resolu0on   Mean  sea-­‐level  pressure   AN  30  Oct    5d  FC  T3999                                    5d  FC  T1279

                         5d  FC  T639  

Sandy  28  Oct  2012  

Precipita:on:  NEXRAD  27  Oct                  

               4d  FC  T639

3d  FC:  Wave  height

                                       4d  FC  T1279                          

 

 Mean  sea-­‐level  pressure  

               4d  FC  T3999  

27

Planned Resolution Upgrades

9Km  by  end  2015  

28

Archive  size  vs.  Supercomputer  power   100000

100000

HPC (GFLOPs) Archive (TB)

10000

10000

1000

1000

100

100

10

10

1

1

IBM-P6 (Jul 2009)

IBM-P5+ (Jan 2007)

IBM-P5 (Jul 2004)

IBM-P4 (Dec 2002)

VPP5000 (Apr 1999)

VPP700-112 (Oct 1997)

VPP700/48 (Jun 1996)

C90/16 (Jan 1993)

C90/12 (Jan 1992)

X-MP/8 (Jan 1990)

0.01

X-MP/4 (Jan 1986)

0.01

X-MP/2 (Nov 1983)

0.1

Cray-1A (Nov 1978)

0.1

Archival  rate  has  always  been  strongly  correlated  with  computa0onal  power  

Resolu0on  Upgrades   Resolu:on  

Grid  size  

Grid  Points  

Field  Size     (in  memory)  

T319  

62.5  km  

204  k  

1.6  MB  

T511  

39  km  

524  k  

4  MB  

T799  

25  km  

1.2  M  

9.6  MB  

T1279  

16  km  

2.1  M  

16.8  MB  

T2047  

10  km  

8.4  M  

67.2  MB  

T3999  

5  km  

20  M  

160  MB  

T7999  

2.5  km  

80  M  

640  MB  

As  memory  per  core  diminishes  (think  GPU’s  and  Intel  Phi)  …   …  this  may  have  serious  implica0ons  on  the  data  processing  sonware!   30

Power  Challenge   → 25%  of  HPC  used  for  opera0ons   → 10%  of  opera0ons  for  HRES  forecast     → 2.5%  of  HPC  

Time  cri:cal  limit  240  days/day  

2  MW

 

                                     6  MW  

→  2.5  km*  forecast  on  Cray   XC-­‐30  (84,000  cores)   =  2  MW     →  240  days/day  require   270,000  cores   =  6  MW     →  x  10**  =  60  MW  ?       *  planned  for  2025     **  scaling  from  HRES  to  full  HPC   31  

Plaiorm  Uncertainty  Challenge   •  How  to  keep  up  with  the  changing   hardware  landscape?   •  How  to  best  use  the  new  heterogeneous   systems?   –  CPU’s –  GPU’s –  Accelerators

•  With  very  limited  human  resources…   32

Hardware  Landscape   Prolifera0on  of  HPC  Architectures  

Intel Xeon

Intel Xeon Phi

IBM Power8

NEC SX-ACE Aurora

NVidia Tesla

ThunderX ARM

AMD FirePro

X-Gene ARM

13/05/15

33

Power  Mabers   “Heterogeneous  High  Throughput  Scien2fic  Compu2ng  with  APM  X-­‐Gene  and  Intel  Xeon  Phi”,  Adburachmanov  et.  al.,  2014    

13/05/15

34

Hypothe0cal  Solu0on   •  •  •  • 

Need  flexibility?   Need  performance?   Code  easier  to  maintain?   A  modern  language?   We  have  a  solu:on…  

C++   13/05/15

35

But  …  

13/05/15

36

HPC  Plaiorm  Restric0ons   •  POSIX     •  Only  supported  languages:   –  Fortran –  C99 –  Recently C++ (but not C++11) –  Shell (Ksh) + Perl (but not Python)

•  Most  of  our  scien0sts  know   –  Fortran –  Python (maybe) –  Basic shell scripting 13/05/15

37

C++  Support  on  HPC   •  Support  for  C++  in  HPC  plaiorms  is  …  patchy!   •  Some  HPC  compilers  aren’t  fully  C++98  and   choke  on  heavily  template  code:   –  Eigen  library  for  linear  algebra   –  Parts  of  Boost   •  serializa0on   •  Proto  

•  Not  to  men0on  C++11!   Change  on  the  way?   Community  &  Vendors  start  talking  about  C++   13/05/15

38

Scien0sts  are  reluctant  to  use  C++   Possible  explana0ons  (my  view):   •  Fortran  is  what  they  know   •  Fortran  is  very  fast  for  numeric  computa0ons   •  C++  was  (ini0ally)  perceived  as  slow:   –  hidden constructors –  misuse of classes –  restrict keyword Mostly corrected now (e.g. work of T. Veldhuizen)

•  But  C++  is  s0ll  perceived  as  …   –  complex (templates are a common complaint) –  less portable (than C or Fortran) 13/05/15

39

Big  Data  Challenge   “Big  Data  is  high  volume,  high  velocity,  and/or  high   variety  informa0on  assets  that  require  new  forms   of  processing  to  enable  enhanced  decision  making,   insight  discovery  and  process  op0miza0on.”       “3D  Data  Management:  Controlling  Data  Volume,  Velocity  and  Variety”,  D.  Laney,  Gartner,  2001  

 

The  3  V’s  of  Big  Data   40

V  is  for  Volume:  Observa0ons   Increase  of  satellite  data  usage  

41

V  is  for  Volume:  Products  

42

V  is  for  Volume:  Archive  

Dele0on  of  1  PB  

43

V  is  for  Velocity   • 

 ECMWF’s  archive  grows  exponen0ally:   Initial volume

Volume of the archive

Time

Rate of growth

–  r is around 0.5, which is a 50% increase per year –  The rate of added data also grows exponentially at the same rate!

  •  • 

In  1995,  the  size  of  the  archive  was  increasing  at  a  rate  of    14  TB/year.   In  2015,  the  size  of  the  archive  increases  at  a  rate  of  100  TB/day    

44

V  is  for  Variety   X-MP/4

Y-MP/8

C90/12

VPP700-48

VPP5000

IBM-P4

IBM-P5

IBM-P5+

IBM-P6

10T C90/16

VPP700-112

T1279L91

T319L50

T106L19 T213L31

T511L60

T319L31

Weekly   Monthly

T799L91

T106L16

1T

T319L60 Ensemble data assimilation

00Z EPS

00Z 10  day  FC

Overlap, CalVal

Vareps/Monthy 4d-Var Model errors

Wave EFIs

SCDA 4D-Var

100G

DCDA

4D-var increments Extra fields,  new   gaussian  grid

Weekend  EPS

Wave proba.

SCDA Forecast

Wave EPS

4D-Var

00Z  Run

1G

Ensemble means & stdev

Clusters

Chernobyl

Probabilities Waves

SSTs

FC Model levels

1/3  growth  is  resolu:on  increase   2/3  growth  is  increase  of  product  types  

Waves FG

EPS

100M

Tubes

Errors in AN and FG NCEP EPS

Other centers

Errors in FG

SCDA Analysis

Errors if FG, surface

OI

Sensitivity

00Z  Run

Multi-Analysis

Wave 4V PT and PV levels

End  sensitivity

10G

DCDA Wave EFIs

SCDA Waves SCDA Forecast

EPS PT levels

TOGA FC

3DVar

FC Pressure levels

4DVar

12 Hour 4DVar

DCDA

Vareps/Monthy

EDA

EPS 15 days

50 Members EPS

10M 85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

00

01

02

03

04

05

06

07

08

09

45

10

Nothing  of  this  is  new   We  have  always  dealt  with  Big  Data…   What  changed?  

46

CPU  Power  Growth   “The  Free  Lunch  is  Over”.  H.  SuRer,  Dr.  Dobb's  Journal,  30(3),  March  2005  

But  what  about  “real”  performance?   47

CPU  Performance  Growth  (single-­‐threaded)   “A  Look  Back  at  Single-­‐Threaded  CPU  Performance”,  J.  Pershing  Feb  2012  

More  registers,  vector  units,  branch  predic0on  …       but  also  harder  to  achieve!     Deeper  memory  hierarchy,  out-­‐of-­‐order  execu:on  …     48

Storage  Density  Growth   Mul0ple  Technologies   “Tape  based  magne2c  recording:  technology  landscape  comparisons  with  hard  disk  drive  and  flash  roadmaps”,     R.  Fontana  et  al,  IBM  Research  Division,  2011  

49

HDD  Storage  Growth   “GPFS  Scans  10  Billion  Files  in  43  minutes”.  R.  Freitas,  et  al.  IBM  Research  Division,  2011  

Volume  is  linearly  propor0onal  to  area  density   …  but  transac0on  rate  hasn’t  kept  up!   Recently  follows  25-­‐40%  CAGR…   This  means  that  we  may  have  the  capacity,  but  maybe  not  the  bandwidth  …   50

What  does  it  imply?   •  “No  Free  Lunch”  è  Improve  our  sonware   –  NWP models have been parallel and concurrent for many decades

  •  Op0mize  Data  Movement   •  Develop  new    Algorithms  that  explore…   –  (More)  Concurrent  computa0ons     –  Data  locality     –  Computa0onal  intensity  (CPU  usage/MB  transferred)   •  Sonware  must  cope  with  changes  –  Flexibility   –  Best  use  of  new  hardware   –  Unknown  future  for  parallel  plaiorms  

51

Learning  from  the  Past   MARS   B.  Raoult,  M.  Fuentes  

13/05/15

52

Meteorological  Archival  and  Retrieval  System   A  managed  archive,  not  a  file  system   •  Users  not  aware  of  data  loca0on   •  Express  access  in  meteorological  terms     Data  is  kept  forever:   •  As  data  is  accumulated,  datasets   become  more  useful   •  Dele0ng  old  data  in  an  exponen0ally   growing  archive  is  meaningless   Consists  of  3  layers:   •  Cache  in  the  HPC  (~80%  hit  ra0o)   •  Servers  HDD  cache  (~80%  hit  ra0o)   •  HPSS  Tape  system   53

Meteorological  Archival  and  Retrieval  System   Fully  distributed   –  15  servers  for  metadata  and  data  movers   –  80  PB  primary  archive   –  1.5  PB  of  disk  cache  (2.5%)   –  110  billion  fields  in  8.5  million  files  (2014)   –  200  million  objects  (2014)   –  100  TB  added  daily   –  100  TB  retrieved  per  day   –  2  million  requests  per  day  

•  •  •  • 

Fully  wriben  in  C++   Since  1995   Index    is  Object  Oriented  DB   Persisted  in  filesystem   54

MARS  2011  Migra0on   •  •  •  •  • 

From  monolythic  to  distributed   From  AIX  to  Linux   From  Big  Endian  to  Lible  Endian   Metadata  indexing  40PB   Job  2  years  in  the  planning  

•  Degraded  service  for  a  couple  of  hours   •  Noone  no0ced...   13/05/15

55

A  meteorological  language   •  retrieve,   date   parameter type   step   levtype levels grid   area  

 =  =  =  =  =  =  =  =

 20110101/to/20110131,    temperature/geopoten0al,    forecast,    12/to/240/by/12,    pressure  levels,    1000/850/500/200,    2/2,    -­‐10/20/10/0  

  •  This  request:   –  represents 31*2*20*4 = 4960 fields –  defines an interpolation followed by cropping 56

Indirec:on  is  the  key  to  scalability  

13/05/15

57

Preparing  the  Future   Atlas   Hermes   13/05/15

58

Aims   •  Big  data  challenge  (scalability)   •  Adaptable  solu0on  (flexibility)   •  Code  evolu0on  (maintainability)  

Lets  use  C++  (98)   13/05/15

59

Atlas   W.  Deconinck,  T.  Quin0no  

Challenges:  New  Algorithms  &  Plalorm  Uncertainty  

13/05/15

60

Current  IFS  Model   •  •  •  •  •  •  •  • 

2M  lines  of  Fortran  code   9000  rou0nes  /  modules   Hybrid  MPI  +  OpenMP   Fixed  data-­‐structure   Well  organized  but…   …  lots  of  global  data.   Imprac0cal  to  rewrite   Fortunately  most  global  opera0ons  are   located  in  the  “dynamical  core”.     13/05/15

61

Atlas   •  Framework  for  parallel,  dynamic  data   structures   •  Suppor0ng  mul0ple  types  of  grids   •  Fully  wriben  in  C++   •  Providing  Fortran  2003  interfaces   •  Basis  to  develop  scalable  dynamical  core  

13/05/15

62

Atlas  capabili0es  

13/05/15

63

So  what  about  Fortran?   •  Keep  Fortran  for  numeric  algorithms   •  C++  for  …   –  All data & memory management –  Message passing

•  For  every  C++  class  -­‐>  Fortran  Derived  Type   •  Use  Fortran  ISO  C  Bindings  (F2003)  

13/05/15

64

C++  contains  implementa0on   // C++ implementation class Field { public: Field(int size) { data_.resize(size); } double* data() { return &data_[0]; } private: vector data_; } // C interface – to be called from Fortran Field* atlas__new_Field (int size) { return new Field( size ); } void atlas__Field__access_data(Field* this, double& data, int& s){ data = this->data(); size = this->size(); } 13/05/15

65

WARNING:  Fortran  code  ahead!  

13/05/15

66

Fortran/C  interface   ! Fortran Derived type , contains no data , only methods F03 TYPE :: Field_type PRIVATE TYPE(C_PTR) :: object = C_NULL_PTR CONTAINS PROCEDURE, PUBLIC :: access_data => Field__access_data END TYPE ! Method calls C function, which uses C++ implementation SUBROUTINE Field__access_data(this,data) CLASS(Field_type), intent(in) :: this TYPE(C_PTR) :: cptr_data REAL(JPRB), POINTER :: data INTEGER :: size ! This is the C function CALL atlas__Field__access_data(this%object ,cptr_data ,size) ! Converts the C pointer to a fortran pointer to array of doubles CALL C_F_POINTER(cptr_data ,data ,(/size/)) END SUBROUTINE

13/05/15

67

Fortran  Program   PROGRAM main TYPE(Field_type) :: field REAL(JPRB), POINTER :: field_data(:) INTEGER :: j ! We can ask the field for access to its data CALL field%access_data(field_data) ! Now we can use the data as any other Fortran array DO j=1,size(field_data) field_data(j) = 0. ENDDO END PROGRAM

•  Furthermore,  we  provide  another  higher   level  API  to  hide  Fortran  2003.   13/05/15

68

Hermes   T.  Quin0no,  B.  Raoult,  F.  Rathgeber  

Challenges:  Big  Data  &  Plalorm  Uncertainty  

13/05/15

69

A  look  at  the  data-­‐chain   Observa:ons  

Obs1  

Obs2  

Obs1   IFS  Model   Obs3  

Obs2  

Obs4  

13/05/15

70

A  look  at  the  data-­‐chain   IFS  Model   Raw  Model   Output  

PG1  

PG2   PG3   Dissemina:on  

PG4   PG1  

PG3   Products  

13/05/15

71

Observa0ons  &  Fields:   Some  Similari0es   •  Both  processed  in  pipelines   •  Both  are  parallel…   –  Concurrent (Threads / OpenMP) –  Distributed (MPI) How does it differ from an atmospheric Model? •  Parallel, but little or no synchronisation •  No time evolution •  “Embarrassingly parallel” (nearly) Possible: Automatic Parallelisation 13/05/15

72

Requirements   •  •  •  •  • 

Way  to  define  computa0ons   Efficient  parallel  computa0on   Way  to  accept  data  and  deliver  results   Flexibility  for  future  architectures   Extendable  and  understandable  by  scien0sts  

13/05/15

73

Solu0on   Indirec:on  is  key  to  Scalability  

Decouple  what  to  compute     from  how  and  where  to  compute  

13/05/15

74

Use  Case:  Product  Genera0on   tasks  

supercomputer  

t1   User   Requests  

f1   generate  

t2  

dispatch  

t3   f2  

t1   t2  

map  (  interpolate,              list  (  f1,  f2,  ...  )  )  

Frontend  (define)  

Linux  cluster  

worksta0on  

Backend  (compute)   13/05/15

75

Use  Case:  Observa0on  Filters   supercomputer  

observa0ons   map  (  f,  list  (  …  )  )  

f   map  (  g•f,  list  (  …  )  )  

f  

dispatch  

g  

map  (  g,  list  (  …  )  )  

g  

Linux  cluster  

map  (  h•g•f,  list  (  …  )  )  

f  

g  

h   worksta0on  

Frontend  (define)  

Backend  (compute)   13/05/15

76

Use  Case:  Worksta0on  Interpola0on  &   Visualisa0on   dispatch  

Fields   T8000   (N  x  670MB)   worksta0on   map  (  interpolate,              list  (  …  )  )  

Frontend  (define)  

Fields   N640   (N  x  10MB)  

Interpola0on  cluster  

Backend  (compute)   13/05/15

77

Hermes   •  •  •  • 

DSL  to  define  computa0ons   Frontend  for  sending  requests  and  data   Broker  to  dispatch  work  to  PU’s   Mul0ple  computa0onal  back-­‐ends:   –  CPU –  CUDA (Linear Algebra) –  Intel MKL (Linear Algebra) –  Intel TBB ?? –  HPX ?? 13/05/15

78

Dispatcher   Producers   Thread  1  

pick   push  

Task  1  

Consumers   Picker  

Task  2   Thread  2  

Task  3  

handle   Handler  

push  

Thread  1   Thread  2  

Task  4   Thread  3   …   Thread  M  

Task  5   Task  6   …   Task  N  

Thread  3   …   Thread  M  

MessageQueue  

13/05/15

79

Expression  Engine   •  •  •  •  • 

Prototype  DSEL   Embedded  in  C++   Plug-­‐ins  in  Fortran  (not  yet)   Computa0ons  as  expression  trees   Hierarchy  of  expressions:   –  terminals (value semantics) –  functions (pure operations)

13/05/15

80

Expression  Engine   •  •  •  •  •  • 

Inspired  by  func0onal  language  Haskell   Expressions  immutable   Func:ons  pure   No  variables,  only  constants   Lazy  evalua0on   Op0miser:  find  most  efficient  path  to   evaluate  expression   –  Pattern matching (e.g. linear combination) –  Collapse expression tree into terminal –  Detect common sub-expressions –  Automatic parallelisation 13/05/15

81

Expression   typedef  std::shared_ptr  ExpPtr;   typedef  std::vector<  ExpPtr  >  args_t;     class  Expression  {   public:      ...      Expression(const  args_t&  args);        size_t  arity()  const  {  return  args_.size();  }      ...      ExpPtr  self();      ExpPtr  eval()  const  {  Scope  s;  return  ExpPtr(optimise()-­‐>evaluate(s))  };      ExpPtr  eval(Scope&  s)  const  {    return  ExpPtr(optimise()-­‐>evaluate(s))  };        ...      //  used  to  bind  lambda  parameters        virtual  ExpPtr  resolve(Scope&)  const  {  return  self();  }      virtual  ExpPtr  optimise(size_t  depth)  const  {  return  self();  }      template  shared_ptr  as()  {  ...  }   private:      virtual  ExpPtr  evaluate(Scope&)  const  =  0;      virtual  ExpPtr  cloneWith(args_t&)  const  =  0;      virtual  std::string  signature()  const  =  0;   };     13/05/15

82

Values   class  Value  :  public  Expression  {   protected:      Value(  const  args_t&  args  )  :  Expression(args)  {}   private:      virtual  ExpPtr  evaluate(Scope&)  const  {  return  self();  }   };  

class  Real  :  public  Value  {   public:      typedef  real_t  value_t;      ...      Real(ExpPtr  e);              Real(const  real_t&  v);      ...      value_t  value()  const  {  return  v_;  }      static  value_t  extract  (const  ExpPtr&  e)  {  return  e-­‐>as()-­‐>value();}      ...   private:      virtual  std::string  signature()  const  {  return  "r";  }   };  

Bool,  Integer,  Real,  String,  List,  Vector,  Matrix   13/05/15

83

Func0ons   class  Function:  public  Expression  {   public:        typedef  ExpPtr  (*func_t)  (Scope&,const  args_t&);      typedef  std::map<  key_t,  func_t  >  dispatcher_t;        virtual  ExpPtr  evaluate(Scope&)  const;      virtual  ExpPtr  optimise(size_t  depth)  const;        virtual  std::string  signature()  const  {  return  signatureArguments(args());  }     protected:      Function(  const  args_t&  args  )  :  Expression(args)  {}      //  returns  something  like  "func(r,v,r,v)"      std::string  signatureArguments(const  args_t&  args)  const  {...}     private:      virtual  ExpPtr  evaluate(Scope&)  const  {  return  self();  }   };  

13/05/15

84

Func0ons  (2)   ExpPtr  Function::evaluate(Scope&  ctx)  const  {        //  create  temporary  args      args_t  tmp  =  args();        for(size_t  i  =  0;  i  <  tmp.size();  ++i)  {          tmp[i]  =  args(i,  ctx,  true)-­‐>self();  //  evaluate  the  arguments      }                std::string  sig  =  signatureArguments(tmp);  //  new  signature        dispatcher_t&  d  =  dispatcher();    //  find  dispatching  function      dispatcher_t::iterator  itr  =  d.find(sig);      if(itr  ==  d.end())    {  ...  fail  ...  }              return  ((*itr).second)(  ctx,  tmp  );  //  dispatch   }     ExpPtr  Function::optimise(size_t  depth)  const  {      //  revert  to  generic  optimizer  if  Expression  doesn’t  optimize  itself      return  Optimiser::apply(self(),  depth);   }  

13/05/15

85

Func0ons  (3)   class  Map  :  public  Function  {   public:      Map(ExpPtr  func,  ExpPtr  lst  );     private:      ...      ExpPtr  Map::evaluate(Scope&  ctx)  const  {            ExpPtr  f    =  args(0,  ctx,  false);  //  get  the  function,  don’t  evaluate          ExpPtr  ls  =  args(1,  ctx,  true);    //  get  the  list  and  evaluate  if  necessary            const  List::value_t&  list  =  List::extract(ctx,  ls);            List::value_t  res;            res.reserve(list.size());          for(  size_t  i  =  0;  i  <  list.size();  ++i  )  {                ExpPtr  e  =  list[i]-­‐>resolve(ctx)-­‐>eval(ctx);  //  evaluate  each  element                ExpPtr  v  =  f-­‐>eval(e);                res.push_back(  v  );            }          return  ExpPtr(new  List(res));      }      ...   };     13/05/15

86

Func0ons  (4)   Map   Reduce   Merge   Take   ZipWith   IfElse   Call   Lambda  

Filter   Bind   UnaryOperators   •  neg,sqrt,abs   BinaryOperators   •  mul,add,sub,div   •  mod,min,max   Predicates  

13/05/15

87

Examples      auto  a  =  real(2.);      auto  b  =  real(4.);      auto  x  =  vector({1.,2.,3.});      auto  y  =  vector(3,  5.);        auto  c  =  add(a,b);        std::cout  <<  *c  <<  std::endl;  //  Add(Real(2),  Real(4))        auto  e  =  add(mul(c,  x)  ,  mul(b,  y));        c  =  c-­‐>eval();        std::cout  <<  *c  <<  std::endl;  //  Real(6)            //  Add(Mul(Add(Real(2),  Real(4)),  Vector(1,  2,  3)),  Mul(Real(4),  Vector(5,  5,  5)))      std::cout  <<  *e  <<  std::endl;          auto  opt  =  e-­‐>optimise(true);        std::cout  <<  opt-­‐>signature()  <<  std::endl;  //  optimised  to  "Linear(r,v,r,v)"        auto  f1  =    reduce(mul(),  list(a,  x,  x))-­‐>eval();      //  Vector(2,  8,  18)   13/05/15

88

Examples   class  Xpr  provides  value  sema0cs  and  convinience  operators      //  lambdas        Xpr  twice  =  call(lambda("x",  Xpr(2.0)  *  Xpr("x")));  //  define  2*x        Xpr  l  =  list(a,b,c);        Xpr  r  =  map(twice,  l);        std::cout  <<  r()  <<  std::endl;  //  List(Real(4),  Real(8),  Real(12))        //  currying        Xpr  F  =  lambda("i",  call(  lambda("j",  Xpr("i")  *  Xpr("j"))));  //  define  i*j            Xpr  H  =  call(F,  Xpr(  2.  ));        //  curry  to  i*2            Xpr  G  =  H(  3.  );                              //  curry  to  3*2        std::cout  <<  G()  <<  std::endl;  //  Real(6)  

13/05/15

89

Examples      //  Define  the  Y-­‐combinator          Xpr  Y  =  lambda("f",  call(Xpr("f"),  Xpr("f")));        //  lookup  table  for  first  10  elements  of  series        Xpr  table  =  list(  Xpr(0.),  Xpr(  1.),  Xpr(  1.),  Xpr(  2.),  Xpr(  3.),  Xpr(5.),                                          Xpr(8.),  Xpr(13.),  Xpr(21.),  Xpr(34.),  Xpr(55.)  );        //  non-­‐recursive  fibonacci  function        Xpr  fib  =  lambda("f",                                        call(lambda("n",                                                                ifelse(                                                                        Xpr("n")  >  Xpr(10.),                                                                        Xpr(call(Xpr("f"),Xpr("f"),  Xpr("n”)-­‐Xpr(1.0)))  +                                                                        Xpr(call(Xpr("f"),Xpr("f"),  Xpr("n”)-­‐Xpr(2.0))),                                                                        take(  Xpr("n"),  table  )                                                                        )                                                                  )                                                    )                                          );        Xpr  fibonnaci  =  call(Y,  fib);  //  applying  the  Y-­‐combinator        std::cout  <<  fibonnaci(20.)  <<  std::endl;  //  Real(6765)   13/05/15

90

Arithme0c  Expression   5  *  7  +  15  ==  add  (  prod  (  5,  7  ),  15  )  

sum   sum   prod  

15   35  

5  

15  

50  

7  

constant  folding  and  propaga0on   13/05/15

91

Filter  Expression   filter  (  greater  (  ?,  2  ),  list  (  -­‐2,  4,  1,  5  )  )  

filter   list   greater   ?  

list   2  

-­‐2  

4  

1  

5  

4  

5  

expression  elimina0on   13/05/15

92

Future  work:  Map  fusion   map  (  h,  map  (  g,  map  (  f,  list  (  …  )  )  )  )  

map   f  

map  

map   g  

f  Ÿ  g  Ÿ  h  

map   h  

list  

list   13/05/15

93

Future  work   Distributed  interpola2on  of  fields   write  (  average  (  map  (  interpolate,  list  (  field1,                                                                                                    field2,                                                                                                    …,                                                                                                    field6  )  )  )  )   write   average  

distribute  

map   interpolate  

field1  

field2  

field3  

list   field4  

field5  

field6   13/05/15

94

Future  work   Distributed  interpola2on  of  fields   write  (  average  (  map  (  interpolate,  list  (  field1,                                                                                                    field2,                                                                                                    …,   write                                                                                                    field6  )  )  )  )  

automa0c   parallelisa0on  

average   list  

list  

thread  1  

interpolate  

interpolate  

interpolate  

interpolate  

field3  

field4  

field5  

field6  

thread  1  

interpolate  

interpolate  

field1  

field2  

Host  a  

list  

thread  2  

thread  2  

thread  1  

list  

Host  b  

thread  2  

Host  c   13/05/15

95

Linear  Algebra  Expressions   norm  (  sub  (  vector,  vector  )  )  

norm   norm  

sub   vector  

vector  

real  

vector   norm  of  difference  of  two  vectors   13/05/15

96

Dispatch  to  Compute  Engines   GPU  engine  

dot   vector  

dispatch  

vector   dot  

matrix  

vector  

Expression   signatures  

MKL  engine  

Dot(v,v)   Dot(m,v)   Dot(m,m)  

cblas_ddot   cblas_dgemv   cblas_dgemm   Generic  engine  

dot   matrix  

cublasDdot   cublasDgemv   cublasDgemm  

matrix  

ddot   dgemv   dgemm  

BLAS/   LAPACK   13/05/15

97

Expressions:  client  code   const  size_t  ncols  =  1024;   const  size_t  nrows  =  1024;     auto  init  =  [](size_t,  size_t  j=0)  {      return  (real_t)std::rand()  /  RAND_MAX;   };     XprClient  c;     //  L2  norm  of  difference  of  2  vectors   c.send(  norm(  sub(  vector(ncols,  init),  vector(ncols,  init)  )  )  );     //  Dot  product  of  vectors   c.send(  dot(  vector(ncols,  init),  vector(ncols,  init)  )  );     //  Dot  product  of  matrix  and  vector   c.send(  dot(  matrix(nrows,  ncols,  init),  vector(ncols,  init)  )  );     //  Dot  product  of  matrix  and  matrix   c.send(  dot(  matrix(nrows,  ncols,  init),  matrix(ncols,  nrows,  init)  )  );   13/05/15

98

Generic  Engine  implementa0on   ExpPtr  compute_dot_generic(Scope&,  const  args_t  &p)   {          Vector::value_t  v1  =  Vector::extract(p[0]);          Vector::value_t  v2  =  Vector::extract(p[1]);          ASSERT(  v1.size()  ==  v2.size()  );          Vector::elemt_t  r;          for(  size_t  i  =  0;  i  <  v1.size();  ++i  )                  r  +=  v1[i]  *  v2[i];          return  real(r);   }     ExpPtr  compute_gemv_generic(Scope&,  const  args_t  &p);   ExpPtr  compute_gemm_generic(Scope&,  const  args_t  &p);     //  Register  generic  functions  in  dispatch  table   struct  DotRegister  {          DotRegister()  {                  Function::dispatcher()["xpr::Dot(v,v)"]  =  &compute_dot_generic;                  Function::dispatcher()["xpr::Dot(m,v)"]  =  &compute_gemv_generic;                  Function::dispatcher()["xpr::Dot(m,m)"]  =  &compute_gemm_generic;          }   };   Static  DotRegister  dotRegister;  

13/05/15

99

MKL  Engine  implementa0on   ExpPtr  compute_dot_MKL(Scope&,  const  args_t  &p)   {      const  Vector::value_t&  v1  =  Vector::extract(p[0]);      const  Vector::value_t&  v2  =  Vector::extract(p[1]);      ASSERT(  v1.size()  ==  v2.size()  );      Vector::elemt_t  r;      return  real(  cblas_ddot(  v1.size(),  v1.data(),  1,  v2.data(),  1  )  );   }     ExpPtr  compute_gemv_MKL(Scope&,  const  args_t  &p)   ExpPtr  compute_gemm_MKL(Scope&,  const  args_t  &p)     //  Register  MKL  functions  in  dispatch  table   void  registerMKL()  {      Function::dispatcher()["xpr::Dot(v,v)"]  =  &compute_dot_MKL;      Function::dispatcher()["xpr::Dot(m,v)"]  =  &compute_gemv_MKL;      Function::dispatcher()["xpr::Dot(m,m)"]  =  &compute_gemm_MKL;   }  

13/05/15

10 0

GPU  Engine  implementa0on   ExpPtr  compute_dot_CUDA(Scope&,  const  args_t  &p)   {    const  Vector::value_t&  x  =  Vector::extract(p[0]);    const  Vector::value_t&  y  =  Vector::extract(p[1]);    ASSERT(  x.size()  ==  y.size()  );    const  size_t  size  =  x.size()*sizeof(real_t);    Vector::elemt_t  r;      real_t*  d_x;  ///<  device  memory  vector  x    real_t*  d_y;  ///<  device  memory  vector  y    cublasHandle_t  handle;      CALL_CUDA(  cudaMalloc((void**)  &d_x,  size)  );    CALL_CUDA(  cudaMalloc((void**)  &d_y,  size)  );    CALL_CUDA(  cudaMemcpy(d_x,  x.data(),  size,  cudaMemcpyHostToDevice)  );    CALL_CUDA(  cudaMemcpy(d_y,  y.data(),  size,  cudaMemcpyHostToDevice)  );      CALL_CUBLAS(  cublasCreate(&handle)  );    CALL_CUBLAS(  cublasDdot(handle,  x.size(),  d_x,  1,  d_y,  1,  &r)  );    CALL_CUBLAS(  cublasDestroy(handle)  );      CALL_CUDA(  cudaFree(d_x)  );    CALL_CUDA(  cudaFree(d_y)  );      return  real(  r  );   }   13/05/15

10 1

Live  Demo  

13/05/15

10 2

Ques0ons?  

13/05/15

10 3

NWP Facing the Future with Cpp.pptx - GitHub

May 13, 2015 - Data sources for the ECMWF Meteorological Opera^onal System. 10 .... 13/05/15. 24. Big Data. Scalable Algorithms. Pla orm Uncertainty ...

25MB Sizes 1 Downloads 63 Views

Recommend Documents

Packer Jaccard Index Future Development Experimental ... - GitHub
Well-known AV signature. 328 byte length ... Moreover changing the encryption key produces a completely diffe- ... lowed by the encrypted virus body. Memorial.

NWP PSD report 17.11.14.pdf
17/11/14 Case Recorded. 171 1120 145326 Restricted 51 Page 1 of 1. Page 3 of 3. NWP PSD report 17.11.14.pdf. NWP PSD report 17.11.14.pdf. Open. Extract.

Java with Generators - GitHub
processes the control flow graph and transforms it into a state machine. This is required because we can then create states function SCOPEMANGLE(node).

Facing the facts
reaching a somewhat mature state, ought to support ... a group, and their suggestion gained support from .... The University of Chicago Press, Chicago. Boyd, R.

Re-envisioning a future in scholarly communication - GitHub
Aug 24, 2017 - Preregistration. Access. Materials. Data. Preservation. Replication. Reviews. Protocols etc. Duct tape solutions won t last.

with ZeroMQ and gevent - GitHub
Normally, the networking of distributed systems is ... Service Oriented .... while True: msg = socket.recv() print "Received", msg socket.send(msg). 1. 2. 3. 4. 5. 6. 7.

Getting Started with CodeXL - GitHub
10. Source Code View . ..... APU, a recent version of Radeon Software, and the OpenCL APP SDK. This document describes ...... lel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf. For GPU ... trademarks of their respective companies.

The Population Challenge Facing Bangladesh
Jun 18, 2011 - New York State College at Farmingdale. ABSTRACT ... would make it to the top of the list in population density. The country may ..... These themes must be brought to the ... www.banglapraxis.wordpress.com/2009/.../indias-.