Holistic Aggregate Resource Environment Eric Van Hensbergen (IBM Research) Ron Minnich (Sandia National Labs) Jim McKie (Bell Labs) Charles Forsyth (Vita Nuova) David Eckhardt (CMU)

Overview

Sequoia

BG/L

Red Storm

Research Topics •

Pre-requisite: reliability and application driven design is pervasive in all explored areas



Offload/Acceleration Deployment Model



Supercomputer needs to become an extension of scientist's desktop as opposed to batch driven, non-standard run-time environment.



Leverage aggregation as a first-class systems construct to help manage complexity and provide a foundation for scalability, reliability, and efficiency.

• •

Distribute system services throughout the machine (not just on io-node) Interconnect Abstractions & Utilization

• •

Leverage HPC interconnects in system services (file system, etc.) sockets & TCP/IP don't map well to HPC interconnects (torus and collective) and are inefficient when hardware provides reliability

Right Weight Kernel • • • •

General purpose multi-thread, multi-user environment Pleasantly Portable Relatively Lightweight (relative to Linux) Core Principles

• • •

All resources are synthetic file hierarchies Local & remote resources accessed via simple API Each thread can dynamically organize local and remote resources via dynamic private namespace

Aggregation •

extend BG/P aggregation model beyond I/O and CPU node barrier



allow grouping of nodes into collaborating aggregates with distributed system services and dedicated service nodes



allow specialized kernel for file service, monitoring, checkpointing, and network routing



parameterized redundancy, reliability, and scaling



allows dynamic (re-) organization of programming model to match the (changing) workload

local service proxy service aggregate service

local service

remote services

Topology

Desktop Extension • • •

Users want super computers to be an extension of their desktop

• •

Monitoring and job control through web interface or MMCS command line



Proposed Research

Current parallel model is traditional batch model Workloads must use specialized compilers and be scheduled from special front-end node. Results are collected into a separate file system Very difficult development environment and lack of interactivity limits productivity of execution environment



leverage library OS commercial scale-out work to allow tighter coupling between desktop environment and super computer resources



Construct runtime environment which includes some reasonable subset of support for typical Linux run-time requirements (glibc, python, etc.)

Extension Example app

brasil

osx

brasil internet

Mac ssh

app

app

Linux

Plan 9

Plan 9

pSeries

I/O

CPU

10GB Ether

collective

...

torus

Native Interconnects •

Blue Gene specialized networks are used primarily by user space run-time



Hardware is directly accessed by user space runtime time environment and are not shared leading to poor utilization



Exclusive use of tree network for I/O limits bandwidth and reliability



Proposed Solution



Light weight system software interfaces to interconnects so that they can be leveraged for system management, monitoring, and resource sharing as well as user applications

Protocol Exploration • • •

The Blue Gene networks are unusual (eg, 3D torus carrying 240-byte payloads)



Related Work: IBM’s ‘one-sided’ messaging operations [Blocksome et al]

• •

IP Works, but isn’t well matched to the underlying capabilities We want an efficient transport protocol to carry 9P messages & other data streams



It supports both MPI and non-MPI applications such as Global Arrays

Inspired by the IBM messaging protocol, we think we might do better than just IP



Years ago there was much work on lightweight protocols for high-speed networks

We are using ideas from that earlier research to implement an efficient protocol to carry 9P conversations

Project Roadmap 0

1

2

Hardware Support Systems Infrastructure Evaluation, Scaling, & Tuning Year

3

20 09

Milestones (Year 1) BASIC

Initial Boot 10 GB Ethernet Collective Network Initial Infrastructure SMP Support Large Pages Torus Network Native Protocol

BASELINE

Baseline 0

10

20

30

Weeks

40

50

BG/L (circa 2007)

PUSH !"#$$% &'(()*+

,-.#

,-.#

!"#$$% &'(()*+

,-.#

,-.#

!"#$$% &'(()*+

,-.#

,-.#

!"#$$% &'(()*+

,-.#

/0$1-.$#2'3

4#(0$1-.$#2'3 ,-.#

!"#$$% &'(()*+

,-.#

,-.#

!"#$$% &'(()*+

,-.#

,-.#

!"#$$% &'(()*+

,-.#

,-.#

!"#$$% &'(()*+

push -c ’{ Figure 1: The structure of the PUSH shell ORS=./blm.dis du -an files |< xargs os chasen | awk ’{print \$1}’ | sort | uniq -c >| sort -rn We have added two additional pipeline operators, a multiplexing fan-out(|<[n]), and a coalescing }’

fan-in(>|). This combination allows PUSH to distribute I/O to and from multiple simultaneous threads of control. The fan-out argument n specifies the desired degree of parallel threading. If no argument is specified, the default of spawning a new thread per record (up to the limit of available cores) is used. This can also be overriden by command line options or environment variables. The pipeline operators provide implicit grouping semantics allowing natural nesting and composibility. While their complimentary nature usually lead to symmetric mappings (where the number of fanouts equal the number of fan-ins), there is nothing within our implementation which enforces it.

Early FTQ Results

Strid3

Y= AX + Y Time for 1024 iterations

Time in seconds

“Stride”, i.e. distance between scalars

Application Support • • •



Native Inferno Virtual Machine CNK Binary Support

• • • •

Elf Converter Extended proc interface to mark processes as “cnk procs” Transition once the process execs, and not before Shim in syscall trap code to adapt arg passing conventions

Linux Binary Support

• •

Basic Linux binary support Functional enough to run basic programs (Python, etc.)

Publications •

Unified Execution Model for Cloud Computing; Eric Van Hensbergen, Noah Evans, Phillip StanleyMarbell. Submitted to LADIS 2009; October 2009.



PUSH, a DISC Shell; Eric Van Hensbergen, Noah Evans. To Appear in the Proceedings of the Principles of Distributed Computing Conference; August 2009.



Measuring Kernel Throughput on BG/P with the Plan 9 Research Operating System; Ron Minnich, John Floren, Aki Nyrhinen. Submitted to SC 09; November 2009.



XCPU2: Distributed Seamless Desktop Extension; Eric Van Hensbergen, Latchesar Ionkov. Submitted to IEEE Clusters 2009; October 2009.



Service Oriented File Systems; Eric Van Hensbergen, Noah Evans, Phillip Stanley-Marbell. IBM Research Report (RC24788), June 2009



Experiences Porting the Plan 9 Research Operating System to the IBM Blue Gene Supercomputers; Ron Minnich, Jim McKie. To appear in the Proceedings of the International Conference on Supercomputing (ISC); June 2009.



System Support for Many Task Computing; Eric Van Hensbergen and Ron Minnich. In the Proceedings of the Workshop on Many Task Computing on Grids and Supercomputers; November 2008.



Holistic Aggregate Resource Environment; Charles Forsyth, Jim McKie, Ron Minnich and Eric Van Hensbergen. In the ACM Operating Systems Review; January 2008.



Night of the Lepus: A Plan 9 Perspective on Blue Gene's Interconnects; Charles Forsyth, Jim McKie, Ron Minnich and Eric Van Hensbergen. In the proceedings of the second annual international workshop on Plan 9; December 2007



Petascale Plan 9. USENIX 2007

Next Steps •



Infrastructure Scale Out

• • • •

File Services Command Execution Alternate Internode Communication Models Fail in place software RAS models

Applications (Linux binaries and native support)

• • •

Large Scale LINPACK Run Explore Mantevo Application Suite



(http://software.sandia.gov/mantevo)

CMU Working on Native Quake port

Acknowledgments • Computational Resources Provided by

DOE INCITE Program. Thanks to the patient folks at ANL who have supported us bringing up Plan 9 on their development BG/P

• Thanks to IBM Research Blue Gene team and the Kittyhawk Team for guidance and support.

Questions? Discussion? http://www.research.ibm.com/hare

Backup

IBM Research, Sandia National Labs, Bell Labs, and CMU

24

Systems Support for Many Task Computing

11/17/2008

(c) 2008 IBM Corporation

IBM Research, Sandia National Labs, Bell Labs, and CMU

Plan 9 Characteristics Kernel Breakdown - Lines of Code Architecture Specific Code BG/P:

~14,000 lines of code

Portable Code Port: TCP/IP Stack:

~25,000 lines of code ~14,000 lines of code

Binary Sizes 415k Text + 140k Data + 107k BSS

25

Systems Support for Many Task Computing

11/17/2008

(c) 2008 IBM Corporation

IBM Research, Sandia National Labs, Bell Labs, and CMU

Why not Linux? Not a distributed system Core systems inflexible VM based on x86 MMU Networking tightly tied to sockets & TCP/IP w/long call-path Typical installations extremely overweight and noisy Benefits of modularity and open-source advantages overcome by complexity, dependencies, and rapid rate of change Community has become conservative Support for alternative interfaces waning Support for large systems which hurts small systems not acceptable Ultimately a customer constraint FastOS was developed to prevent OS monoculture in HPC Few Linux projects were even invited to submit final proposals

26

Systems Support for Many Task Computing

11/17/2008

(c) 2008 IBM Corporation

IBM Research, Sandia National Labs, Bell Labs, and CMU

Everything Represented as File Systems Hardware Devices

Application Services

System Services

Disk

TCP/IP Stack /dev/hda1

/dev/hda2

Network /dev/eth0

DNS /net

/net /arp /udp /tcp /clone /stats /0 /1 /ctl /data /listen /local /remote /status

/cs /dns

GUI

/win /clone /0 /1 /ctl / data / refresh

Console, Audio, Etc.

27

/2

Process Control, Debug, Etc.

Systems Support for Many Task Computing

Wiki, Authentication, and Service Control 11/17/2008

(c) 2008 IBM Corporation

IBM Research, Sandia National Labs, Bell Labs, and CMU

Plan 9 Networks

Screen Phone Set Top Box

PDA Smartphone

‫)‏‬

‫)‏ )‏‬

Term Term

Term

Term

Cable/DSL

Wifi/Edge

Internet LAN (1 GB/s) Network File Server

Content Addressable Storage

28

CPU Servers

CPU Servers

High Bandwidth (10 GB/s) Network

Systems Support for Many Task Computing

11/17/2008

(c) 2008 IBM Corporation

IBM Research, Sandia National Labs, Bell Labs, and CMU

Aggregation as a First Class Concept

Local Service

Proxy Service

Remote Service 29

Systems Support for Many Task Computing

Aggregate Service

Remote Service 11/17/2008

Remote Service (c) 2008 IBM Corporation

IBM Research, Sandia National Labs, Bell Labs, and CMU

Issues of Topology

30

Systems Support for Many Task Computing

11/17/2008

(c) 2008 IBM Corporation

IBM Research, Sandia National Labs, Bell Labs, and CMU

File Cache Example Proxy Service Monitors access to remote file server & local resources Local cache mode Collaborative cache mode Designated cache server(s)‫‏‬ Integrate replication and redundancy Explore write coherence via “territories” ala Envoy

Based on experiences with Xget deployment model Leverage natural topology of machine where possible.

31

Systems Support for Many Task Computing

11/17/2008

(c) 2008 IBM Corporation

IBM Research, Sandia National Labs, Bell Labs, and CMU

Monitoring Example Distribute monitoring throughout the system Use for system health monitoring and load balancing Allow for application-specific monitoring agents

Distribute filtering & control agents at key points in topology Allow for localized monitoring and control as well as high-level global reporting and control Explore both push and pull methods of modeling Based on experiences with supermon system.

32

Systems Support for Many Task Computing

11/17/2008

(c) 2008 IBM Corporation

IBM Research, Sandia National Labs, Bell Labs, and CMU

Workload Management Example Provide file system interface to job execution and scheduling. Allows scheduling of new work from within the cluster, using localized as well as global scheduling controls. Can allow for more organic growth of workloads as well as top-down and bottom-up models. Can be extended to allow direct access from enduser workstations. Based on experiences with Xcpu mechanism.

33

Systems Support for Many Task Computing

11/17/2008

(c) 2008 IBM Corporation

IBM Research, Sandia National Labs, Bell Labs, and CMU

Right Weight Kernels Project (Phase I)‫‏‬ Motivation OS Effect on Applications Metric is based on OS Interference on FWQ & FTQ benchmarks. AIX/Linux has more capability than many apps need LWK and CNK have less capability than apps want Approach Customize the kernel to the application Ongoing Challenges Need to balance capability with overhead

34

Systems Support for Many Task Computing

11/17/2008

(c) 2008 IBM Corporation

IBM Research, Sandia National Labs, Bell Labs, and CMU

Why Blue Gene? Readily available large-scale cluster Minimum allocation is 37 nodes Easy to get 512 and 1024 node configurations Up to 8192 nodes available upon request internally FastOS will make 64k configuration available DOE interest – Blue Gene was a specified target Variety of interconnects allows exploration of alternatives Embedded core design provides simple architecture that is quick to port to and doesn't require heavy weight systems software management, device drivers, or firmware

35

Systems Support for Many Task Computing

11/17/2008

(c) 2008 IBM Corporation

Eric Van Hensbergen (IBM Research)

Leverage HPC interconnects in system services (file system, etc.) ... Systems Infrastructure .... IBM Research, Sandia National Labs, Bell Labs, and CMU.

3MB Sizes 0 Downloads 204 Views

Recommend Documents

Eric Van Lustbader - The Bourne Legacy.pdf
that you're sending a child to his death?" Page 3 of 312. Eric Van Lustbader - The Bourne Legacy.pdf. Eric Van Lustbader - The Bourne Legacy.pdf. Open.

IBM Research Report
2.4 Mobile Access to an Enterprise Database . .... service discovery, security, smart card interfaces, and an application-programming interface (API) ... architecture of the framework, a Java API, and a tutorial for application programmers. ... A dat

DC; Office ofNaval Research, Arlington, Va. represent what took ... - Eric
English. Chapter three describes the following components,of SAM: the ... The Yale A.I. pro.4ect iscomposed of: ... the'membera of the A.J. project contributed to writing this paper, programming a piece 'of SAM, the ideas behind SAM, or.all three. SA

Synthesis of IES-Funded Research on Mathematics: 2002–2013 - Eric
teachers assessed students with a dynamic tool that provided support for learning new mathematics topics, along with ...... tracking and monitoring the status of endangered species, and providing opportunities for practicing arithmetic and fraction .

IBM Research Africa Visiting Scientist Fellowship in ... -
Aug 30, 2015 - Performing complex data analysis, develop algorithms, and visualise large data sets. • Executing decision analytics / insight projects as part of ...

Synthesis of IES-Funded Research on Mathematics: 2002–2013 - Eric
Synthesis Context: Why Research on Mathematics Learning Is Needed. ..... as finding areas; dividing sets; or gathering, recording, and interpreting data as well as .... teachers assessed students with a dynamic tool that provided support for ...

DC; Office ofNaval Research, Arlington, Va. represent what took ... - Eric
Each script used by SAM is organized in a top-down manner as follows: . into tracks, consisting of scenes, which in turn consist of subscenes. Each track of a script corresponds to a manifestation of the situation differing in minor but important fea

Synthesis of IES-Funded Research on Mathematics: 2002–2013 - Eric
manage the development of a synthesis of IES-funded research on mathematics. Tamara Nimkoff .... peer-reviewed publications that were products of IES-funded research projects that focused on mathematics ... long integrated algebra I curriculum to an

Dr. Michael Rosen-Zvi (Director, Health Informatics, IBM Research)
UCD Research & Innovation - Presentation - Dr. Micha ... ealth and Medicine at IBM Research' (23-11-2015).pdf. UCD Research & Innovation - Presentation ...

IBM Research Africa Visiting Scientist Fellowship in ... - UCT
30 Aug 2015 - various data sources (text, weather, geo-spatial, spatio-temporal, demographics and social). • Experience in large scale data linking and architecture, data mining, information retrieval ... optimization; network analysis. A PhD in a

mortal coils - Eric Nylund
Syne, writers can be hermits, bears in caves that growl at all outsiders. Who but another writer to draw me out and keep me sane and civilized with your love? One could not ask for a better soul mate. Kai, my marvelous son, you were three and four ye

ERIC Accounts
searching through database vendors like EBSCO, ProQuest, and First Search. ... the top of the page, you will be taken to a list of all the ProQuest databases Briggs .... that the request was submitted and an email with instructions for accessing ...

Van Boekel.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Van Boekel.pdf.

pdf-14104\ibm-websphere-interview-questions-unofficial-ibm ...
... apps below to open or edit this item. pdf-14104\ibm-websphere-interview-questions-unofficial- ... ation-server-certification-review-from-equity-press.pdf.

Van-Norden_Hermeneutics.pdf
4PNFCPPLTBSFUPCFUBTUFE. PUIFSTUPCFTXBMMP XFE ... Van-Norden_Hermeneutics.pdf. Van-Norden_Hermeneutics.pdf. Open. Extract. Open with.