Open  Resilient  Cluster  Manager   (ORCM)  

Ralph  H.  Castain,  Ph.D.  

1

Objec
–  Easily  customized,  extended   –  Replace/override  any  behavior   –  Support  proprietary  as  well  as  open  extensions   –  Fully  u
•  Establish  ecosystem  

–  Academic,  industry  collaborators,  OMPI-­‐like   community  

•  Provide  a  reference  solu
–  Publicly  available,  performant,  flexible,  scalable   –  Easily  replace  any  pieces   2  

2003  -­‐  present   OMPI/ORTE  

10s  of  Knodes   Open  MPI  

OpenRTE  

Enterprise   Router  

Cisco  

Cluster   Monitor  

EMC  

SCON/RM  

Intel  

ORCM/SCON   3

ORCM  Roadmap   •  Monitoring  system  (1Q2015)   –  System  environment,  power,  process  usage,  etc.  

•  Overlay  network/pub-­‐sub  (3Q2015)   –  More  in  a  minute  

•  Job  launch  (3Q2015)   –  Can  do  it  now,  but  need  support  for  all  MPIs  

•  Workload  manager  (2016)   –  Lightweight   4  

Hierarchical  Arch   SMC  

Row  

Row  

Rack  

Rack  

Rack  

Rack  

Rack  

Rack  

CN   CN  

CN   CN  

CN   CN  

CN   CN  

CN   CN  

CN   CN  

CN   CN  

CN   CN  

CN   CN  

CN   CN  

CN   CN  

CN   CN   5  

Integra
•  Network   –  QoS  controls   –  Sta
•  Power   –  Various  modes,  dynamic  controls,  site-­‐level  control   6

Instant  On  Steps   •  Prestage  executables  to  IO  nodes   •  Allocate  and  launch   •  • 

Launch  message  =>  orte_job_t,  included  in  alloca
•  Distributed  mapping/rolling  start  (branch)  

–  Each  daemon  computes  map,  stores  all  data  (map,  endpoints,  network   topo)  in  shared  memory  region  for  job   –  Connect/accept  =>  pass  SM  connec
•  Eliminate  modex  

–  Sta
•  Eliminate  fence  at  end  of  mpi_init  

–  Modex-­‐recv  becomes  flag  that  proc  is  ready   –  RM  flags  all  procs  on  node  upon  first  request  so  subsequent  checks   are  local   7

Defini
•  In-­‐flight  analy
Requirements   •  Scalable  to  exascale  levels   –  Beier-­‐than-­‐linear  scaling  of  broadcast  

•  Resilient   –  Self-­‐heal  around  failures   –  Reintegrate  recovered  resources  

•  Dynamically  configurable   –  Sense  and  adapt,  user-­‐directable   –  On-­‐the-­‐fly  updates  

•  Open  source  (non-­‐GPL)   9

High-­‐Level  Architecture   SCON-­‐MSG  

RMQ  

ZMQ  

Send/Recv  

BTL  

STL   Portals4   IB   UGNI  

SM   USNIC  

OOB   TCP   CUDA  

UDP   TCP  

SCIF   SCON-­‐ANALYTICS  

FILTER  

AVG  

Workflow/Pub-­‐Sub  

THRESHOLD  

10  

Messaging  APIs   •  Typical  send/recv   –  Non-­‐blocking,  iovec  or  buffered  (built-­‐in  heterogeneous  support)  

•  Open  channel   –  Specify  remote  peer  and  endpoint  tag   –  Provide  hints  on  type  of  data  messaging  to  be  used   •  Stream,  command/control,  etc.  

–  Specify  desired  quality  of  service   •  Guaranteed  delivery  of  every  message,  high  priority,  etc  

•  Subscribe  to  data  stream   –  Specify  source  and  data  

11  

Message  Rou
•  Defined  per  transport  (branch)   •  Heals  routes   –  Provides  alternate  route  upon  failure   –  Up-­‐level  error  if  no  alternate  available  on  this   transport   –  Allows  re-­‐rou
Message  Reliability   •  Plugin  architecture   –  Selected  per  transport,  requested  quality  of  service  

•  ACK-­‐based  (cmd/ctrl)   –  Ack  each  message,  or  window  of  messages,  based  on  QoS   –  Resend  or  return  error  –  QoS  specified  policy  and  number  of  retries  before   giving  up  

•  NACK-­‐based  (streaming)   –  Nack  if  message  sequence  number  is  out  of  order  indica
•  Mul
Analy
•  Event  genera

Open Resilient Cluster Manager (ORCM) - GitHub

Hierarchical Arch. 5. SMC. Row. Row. Rack. CN CN. CN CN. Rack. CN CN. CN CN ... Each daemon computes map, stores all data (map, endpoints, network.

413KB Sizes 9 Downloads 308 Views

Recommend Documents

Actionable = Cluster + Contrast? - GitHub
pared to another process, which we call peeking. In peeking, analysts ..... In Proceedings of the 6th International Conference on Predictive Models in Software.

trouble ticket manager - GitHub
Department of Computer Science and Information Technology ... Different software tools have been developed and are in use in order to handle and ..... Since this application filters trouble tickets and automate based on the contents .... or e-busines

Open Adventure - GitHub
Sonic attack is considered 20 times louder than speaking volume. Any characters (except you) within range must succeed at a will save or suffer 1 stun counter.

Open MPI development - GitHub
Jan 29, 2015 - (ad d_ co… om pi_sh ow. _a ll_m ca_ pa rams op al_p rog ress_ set_e ... 1.0E+01. 1.0E+02. 1.0E+03. 1.0E+04. M emory. Inc rease in. M. C. A. _P. M. L_ ..... Express. PCI. Express. Comm. Engine. (Packet. Processing). Comm.

Open Adventure - GitHub
“Dungeons & Dragons” and “D&D” are registered trademarks of Wizards of the Coast. The OSR .... ter's minotaur race would be lost to insure compatibil- ity with D&D. The second methodology is the Unorthodox Con- version. With this method, we a

Open Adventure - GitHub
Page 1. Open Adventure. Adventure Pack. Page 2. Equipment Cards. Page 3. Equipment Cards. Page 4. Page 5.

Resource Manager - GitHub
There are three RSpec types: • Advertisement (short: ads). Announces which resources/slivers are available. • Request. Specifies the wishes of the experimenter.

Open Government - GitHub
The road to open government is a long one, and over the years many people have con‐ .... hard to use, because the only people using them were computer scientists. ...... with a degree in the physical sciences or engineering” is another example of

Open Adventure - GitHub
DOWNLOAD, DISCUSS AND DEVELOP THE OPEN ADVENTURE GAME RULES AT: ...... ANDROID: (Medium/Stout) Androids are machine ro- bots made to ...

Open Data Canvas - GitHub
Top need for accessing data online. What data is most needed? Solution. How would you solve this problem? ... How big is the universe of users? Format/Use.

Open Modeling Framework - GitHub
Prepared for the U.S. Department of Energy, Office of Electricity Delivery and Energy Reliability, under Contract ... (ORNL), and the National Renewable Energy.

Cluster-parallel learning with VW - GitHub
´runvw.sh ´ -reducer NONE. Each mapper runs VW. Model stored in /model on HDFS runvw.sh calls VW, used to modify VW ...

Open Adventure - GitHub
DOWNLOAD, DISCUSS AND DEVELOP THE OPEN ADVENTURE GAME RULES AT: ...... Additional information about starship systems (android control ...

Open Vehicle Monitoring System - GitHub
Aug 14, 2013 - 10. CONFIGURE THE GPRS DATA CONNECTION (NEEDED FOR ...... Using the OVMS smartphone App (Android or Apple iOS), set Feature ...

Open Shading Language 1.9 - GitHub
1 Introduction. 1. 2 The Big Picture. 5. 3 Lexical structure. 11. 3.1 Characters . .... OSL shaders are not monolithic, but rather can be organized into networks of ...

Open Vehicle Monitoring System - GitHub
Feb 5, 2017 - GITHUB. 10. COMPILE AND FLASH YOUR FIRST FIRMWARE. 10. CHIPS USED .... If your laptop already has a RS232 port, then you can ... download your own forked repository from github to your local computer. Detailed ...

i3 Window Manager (i3wm) - GitHub
Mar 2, 2017 - Most modern OSes have workspaces as well. Workspaces refer to ... the side, hope that your desktop environment supports window snapping ...

Adaptive Scheduling Parameters Manager for ... - GitHub
Jun 27, 2014 - Solution: a set of tools that manage SCHED DEADLINE parameters adaptively ..... Adaptive Quality of Service Architecture. Wiley. InterScience ...

Open Data publishing method framework - GitHub
Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. Build. Assess. Iterate. Certification. Publish!

Open Source Code Serving Endangered Languages - GitHub
ten called low-resource, under-resourced, or minority lan- guages) ... Our list is updatable more ... favorites on other social media sites, and, generally, a good.

Self-archiving and open access publishing - GitHub
Oct 21, 2016 - Open Science Course ... Preprint servers provide free access to all papers. ▷ Results ... All researchers have a chance to give feedback, large.

Open putty and type remote hostname - GitHub
(here I put simply hostname ) and then click Save button. From now on your connection to remote host with tunnel is saved and can be reused anytime you open ...

OGC® Open Geospatial APIs - White Paper - GitHub
Appendix E: Open APIs and Licensing. 1. What is an API? 1.1. API de ned ... (Definition from Dictionary of Computer Science - Oxford ..... As the popularity of APIs has grown in the past few years, so too have the tools, best ..... Report (http://www