carnegie mellon university

Viewer
Transcript

CARNEGIE MELLON UNIVERSITY

P-Reg: A Performance Regression and Debugging Tool for Parallel NFS

By Zainul Abedin Abbasi [email protected]

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

Master of Science In Information Networking

INFORMATION NETWORKING INSTITUTE

Pittsburgh, PA USA May 12, 2006

Acknowledgements I thank my thesis advisors, Professor Gregory Ganger and Professor Garth Gibson, for the opportunities to work on such ambitious projects and their constant optimism.

I would like to thank Swaroop Choudhari for the tremendous effort put forth to help develop this tool.

I would like to thank, Dean Hildebrand, Rob Ross, Rob Latham and Sam Lang for their time and comments. Special thanks to my family for unwavering support. Without their help, I could not have come this far.

Table of Contents Abstract.....................................................................................................................................................5 1

Introduction ....................................................................................................................................6

2

Methodology....................................................................................................................................9

3

Architecture ..................................................................................................................................10 3.1

Context Diagram......................................................................................................................10

3.2

Runtime/Component and Connector (CNC) view ................................................................11

3.3

Deployment/Physical view Diagram.......................................................................................13

4

Implementation.............................................................................................................................14

5

Example Run.................................................................................................................................16

6

5.1

Customization Wizard.............................................................................................................16

5.2

Graph-Panel .............................................................................................................................18

Example Problem .........................................................................................................................19 6.1

Experimental Setup .................................................................................................................19

6.2

Environment check (Stage 0) test results...............................................................................20

6.3

Regression Run Results ...........................................................................................................21

6.4

Net Saturation Tool (netSat) ...................................................................................................24

7

Future Work .................................................................................................................................25

8

Summary and Conclusion............................................................................................................26

References ...............................................................................................................................................26

List of Figures Figure 1: NFS Architecture.................................................................................................................... 6 Figure 2: pNFS Architecture [15].......................................................................................................... 7 Figure 3 Reference Implementation...................................................................................................... 9 Figure 4: P-Reg Context Diagram....................................................................................................... 10 Figure 5: CNC view .............................................................................................................................. 11 Figure 6: Deployment view .................................................................................................................. 13 Figure 7: Customization Window........................................................................................................ 16 Figure 8: Sample Config File ............................................................................................................... 17 Figure 9: Graph Panel Window........................................................................................................... 18 Figure 10: Sample output ..................................................................................................................... 18 Figure 11: Experimental Setup............................................................................................................ 19 Figure 12: Throughput vs Record size for file size 4MB, Stripe size 64KB .................................... 21 Figure 13: Throughput vs Record size for file size 512MB, Stripe size 64KB ................................ 22 Figure 14: TCP sequence number vs Time......................................................................................... 23 Figure 15: netSat Output - Throughput vs Number of Server ......................................................... 24

Abstract Parallel NFS improves file access performance and scalability by providing multiple clients with support for direct storage access. It is fairly common to use iterative development to build such systems. Hence, a performance regression suite and performance debugging tool would be helpful to uncover performance issues in new code and help track performance across versions. Additionally, such a tool would also simplify experimentation for proving hypotheses and better understanding of results through summarized outputs. Since parallel file systems like pNFS are distributed in nature, the problem of tracking performance and debugging performance issues is non-trivial. Existing performance regression and debugging tools vary across developer communities, making it difficult to compare results. This report describes a new performance regression suite and performance debugging tool for the pNFS reference implementation being developed at University of Michigan. We aim to create single tool, available to all pNFS and other file system developers that consolidates and builds on knowledge from various tools across developer communities.

1

Introduction Traditionally, shared access to a set of storage resources is achieved via distributed file systems such as NFS [Figure 1].

Figure 1: NFS Architecture

The basic NFS architecture is shown in figure 1. In this model, all client requests are serviced by the central NFS server. This model is adequate for a limited number of clients with modest data requirements. However, in the face of a large number of data hungry clients, the central NFS server tends to become a performance bottleneck.

With the advent of storage networking, storage nodes became first class network entities, making it possible for clients to access storage nodes directly. This led to out of band data access, which alleviates this bottleneck and allows the aggregate bandwidth of the storage system to scale with the number of storage nodes. It achieves this by separating the control and data flows. This separation provides a straightforward framework to accomplish high scalability, by allowing transfers of data to proceed in parallel directly from/to many clients

from/to many storage node endpoints. Control and file management operations, inherently more difficult to parallelize, can remain the province of a centralized file manager, inheriting the simple management of today's NFS file service.

Figure 2: pNFS Architecture [15] pNFS extends NFSv4 by adding a layout driver, an I/O driver, and a file layout retrieval interface. The pNFS server creates a file layout map and transfers it to the pNFS client, which delivers it to its layout driver for direct and parallel data access. The pNFS client uses the layout driver for all I/O.

Parallel NFS (Figure 2), which seeks to provide a standardized interface for out of band data access, aims to solve the scalability problem while being an open standard. Parallel NFS allows parallel access to data by returning to the client a layout or map of where data is in the file system, instead of doing data access on behalf of the client. Data transfer to/from storage nodes may be done using NFS or other protocols, such as iSCSI or OSD, under the control of an NFSv4 server with parallel NFS extensions. Such an approach protects the industry's large investment in NFS, since the bandwidth bottleneck no longer needs to drive users to adopt a proprietary alternative solution, and leverages SAN storage infrastructures, all within a common architectural framework.

Parallel NFS is part of the IETF NFSv4 minor version 1 standard draft. The companies involved in its development include Network Appliance, EMC, IBM, Sun and Panasas. University organizations involved include Center for Information Technology Integration (CITI) at the University of Michigan and Parallel Data Laboratory (PDL) at Carnegie Mellon University. The developers use iterative development to build pNFS systems, and hence can benefit from a performance regression suite to uncover performance issues in new code and help track performance across versions. A performance debugging tool can also help analyze and understand the performance problems exposed by the regression suite. The problem of tracking performance and debugging performance issues in a parallel distributed file system is nontrivial. Currently, the tools used vary across developer communities, making it difficult to compare results. This report describes a new performance regression suite and performance debugging tool, called P-Reg, that is now available to all pNFS developers.

The remainder of the report is organized as follows. Section 2 discusses methodology used in building this tool. Sections 3 and 4 describe the design, architecture and implementation of the tool. Section 5 describes how to use the tool. Section 6 discusses an example problem. Section 7 discusses future work. Section 8 summarizes and concludes.

2

Methodology In building P-Reg we looked at the pNFS system as a black box. The CITI reference implementation [Figure 3] was used as a testbed.

Figure 3 Reference Implementation In the above figure, the red lines show where the strace tool was used to trace the interaction between modules, and the pink lines show where tcpdump was used to trace calls between modules that were made over the network. The client and server color code is used to show that the modules exist on physically separate machines.

A feedback process was used to build the tool, with the following steps 1. Performance tests were run on the testbed using P-Reg. 2. Results were understood by running additional tests and using external tools. 3. The tests and tools that helped us understand the issues were added to P-Reg.

3

Architecture One definition of software architecture, given by [14], is “The structure of the components of a program/system, their interrelationships, and principles and guidelines governing their design and evolution over time”. This definition is rather vague and to give a clear idea and help understand the architecture of our system we will go through three different diagrams •

Context Diagram

•

Runtime/Component and Connector (CNC) view diagram

•

Deployment/Physical view diagram

3.1 Context Diagram The context diagram of any system is a top level requirements diagram that documents the external environment of a black box system [14].

Figure 4: P-Reg Context Diagram The context diagram for P-Reg is shown in figure 4. This diagram shows how P-Reg fits into the entire system. As illustrated, a user interacts with P-Reg which in turn runs tests, specified by the user, on the pNFS system. Once the tests are complete, P-Reg will collect the data generated, process them, and then produce summarized outputs for the user.

3.2 Runtime/Component and Connector (CNC) view The CNC view diagram of any system documents the different components that share responsibility of the entire system and the different types of connections that are permissible between these components.

Figure 5: CNC view The Runtime/CNC view diagram of P-Reg is shown in figure 5. This diagram shows the different components in the system and how they are related to each other. The responsibilities of the system are shared among these components or processes. The different components and their responsibilities are listed in table 1.

Component Graphical User Interface (GUI)

Responsibilities Interact with user Provide summarized outputs to user Invoke core to run specified tests.

Component

Responsibilities Invoke Performance Test Runner to run the specified tests. Save results and relevant information of tests in database. Process Information and generate summarized outputs.

Core

Performance Test Runner

Run specified tests.

MySQL Client

Provide core with API to connect and manipulate data in the MySQL Server.

MySQL Server

Database server to save results and other information about tests.

External Tools hdparm, Iperf, NetSaturation (netSat) Tools used to verify the environmental setup. IOzone

File system benchmark used to run performance tests on the pNFS system.

tcpdump, systat

Tools used to trace network and disk activity.

Gnuplot, tcptrace, ethereal

Tools used to generate summarized outputs.

[Table 1: P-Reg Components and responsibilities]

3.3 Deployment/Physical view Diagram The Deployment view diagram (figure 6) documents where each component defined in the CNC diagram resides.

Processes Running on User machine Processes Running on Clients

Processes running on pNFS storage nodes

Figure 6: Deployment view The table below gives the list of processes running on the different machines. Machine Host machine (user machine)

Component Core, GUI, MySQL, Performance test runner, gnuplot, tcptrace and ethereal

PNFS System Clients

Iozone, tcpdump, iperf, netsaturation (netSat), strace.

NFS server

Iperf

Storage/Data Servers

Hdparm, iperf, netsaturation(netSat), tcpdump.

[Table 2: P-Reg Deployment of components]

4

Implementation The P-Reg tool executes in 3 Stages 1. Stage 0 – Environment check stage Here, the user can run tests to test different components of the pNFS system separately. For example, network bandwidth between any pair of machines or disk bandwidth of disks on storage servers. 2. Stage 1 – Regression Run stage In this stage, the user can choose and run different performance tests. 3. Stage 2 – Process and generate In this stage, the user can view summarized outputs of different runs and also compare different runs. The tool is built on a linux platform. Implementation details of the tool are summarized in table 3. Component

Tool used

Lines of Code

Graphical User Interface

Java Swings (ver 1.5.1)

4K

Performance Test Runner

BASH

1.5K

Database

MySQL Server MySql C API – C interface to Core JDBC sql connector – Java interface to core.

Core

C and JAVA 1.5 and above

2.5K

Plotting and summarized outputs

Gnuplot

2K

[Table 3: P-Reg Implementation details]

The different tools used to build P-Reg are •

IOzone [1] – A file system benchmark tool. The tool generates and measures various file operations.

•

Iperf [2] – A tool to measure network performance.

•

Hdparm[3] – A tool used to get/set ATA/SATA disk parameters.

•

Tcpdump[4] – A tool used to collect network traffic.

•

Ethereal/Wireshark[5] – A network protocol analyzer.

•

Tcptrace[6] – a tool used to analyze tcp traffic.

•

Sysstat[7] – A package of utilities used to monitor system performance and usage activity.

•

Strace[8] – A tool to trace system calls and signals called by a user process.

•

Distributed shell or dancer shell (dsh) [9] – A tool used to execue multiple remote shell commands.

•

Gnuplot [10] – data and function plotting utility.

5

Example Run In this section, we will go through how a user can use the tool by going through an example. A detailed description of how to use the tool can be found in the user manual available at https://bum.pdl.cmu.edu/~pubtwiki/cgi-bin/view/PNFS/User_manul On starting the tool, the user is provided with a GUI that allows him to either run a new performance regression using the customization wizard window (Figure 7) or view summarized outputs of previous runs using the Graph-Panel window (Figure 9).

5.1 Customization Wizard

Figure 7: Customization Window The configuration wizard can be used to configure the regression test that you want to run. The selections made on the configuration wizard will define what tests are run and what metrics are collected for a particular test. P-Reg executes in 3 stages, the environment check/diagnostic stage, the regression stage and the analysis/results stage. Using the configuration wizard, you can define the first two stages of P-Reg.

To run a regression, the user will need to specify a XML format configuration file that

describes the setup used for the test. A sample configuration file is shown in Figure 8. Detailed description on how to create a configuration file is provided in the user manual.

Figure 8: Sample Config File The configuration file is a XML format file that describes the setup used for the test. The user needs to specify information about clients, data servers and the NFS server in this file.

Once the user specifies the required information, he can click “Run” to initiate the test. On doing so, the system performs the following steps. 1. The GUI takes the inputs from the user and forwards them to the core process. 2. The core then saves information about the run in the database and, depending on the inputs, invokes the performance test runner to run the specified tests. 3. After the performance test runner completes running the tests, the core extracts the data and saves it in the database. 4. On completion, control is returned back to the user, after which he can either run another test or view the results of the tests.

5.2 Graph-Panel

Figure 9: Graph Panel Window Through this panel the user can view summarized outputs of a run or compare the output of two different runs by checking the appropriate option. On clicking “Go” the user will be able to choose the summarized output he wishes to view. The outputs available depends on the runs that were selected while customizing the run.

A User can view summarized outputs of different runs or compare 2 different runs, by using the Graph-Panel window. A Sample summarized outputs is shown below.

Figure 10: Sample output This output shows a comparison of throughput vs record size for two different runs. Note this is just a sample output and not actual values from a pNFS system.

6

Example Problem This section describes a performance problem we encountered, during development of P-Reg, to illustrate how it can be used to analyze and understand performance issues.

6.1 Experimental Setup The experimental setup of the system is shown in figure 11.

Figure 11: Experimental Setup The resources utilized for the setup are • • • • • • • •

9 machines running Linux 2.6.16 kernel with nfsv4.1 support (available from CITI) Each node is a • 2.4 GHz Intel Pentium(R) 4 1 GB DDR RAM Single WDC ATA disk. Gigabit ethernet connectivity - Single switch - HP Procurve 2848 IOzone version 3.263 - File system benchmarking tool. Hdparm – tool to get information about ATA/SATA disks. Iperf - Benchmark to measure the performance of the network.

The usage of the 9 different machines is as shown below:

Machine Role

Number of Machines Pvfs2 Storage Available Not Applicable 4 5GB 1 5GB 4

Client Meta Data Server Storage/Data Server

The system is setup with the PVFS2 default stripe size of 64KB.

6.2 Environment check (Stage 0) test results Disk Performance test The performance of the disks on the data servers is as below (average of 3 Runs)

Data Server Storage 1 Storage 2 Storage 3 Storage 4

Timing Buffered Disc read 164 MB in 3.03 seconds 164 MB in 3.01 seconds 162 MB in 3.01 seconds 148 MB in 3.02 seconds

Read performance 53.83 MB/sec 54.48 MB/sec 53.82 MB/sec 49.08 MB/sec

Network Performance test The link bandwidth (in Mbits/s) between the different machines in the system is as shown in the table below. Cc01 Cc01 Cc02 Cc03 Cc04 Cc05 Cc06 Cc08 Cc09 Cc10

Cc02

Cc03 938.28

937.03 938.03 939.03 940.03 941.03 942.03 943.03 944.03

938.79 937.56 937.36 937.17 937.71 937.71 937.12

Cc04 939.28 937.3 937.3 937.58 938.42 936.98 936.84 937.12

Cc05 940.28 938.49 939.04 937.27 936.91 936.95 936.84 937.12

Cc06 941.28 937.99 936.75 937.38 936.94 937.16 936.84 937.12

Cc08 942.28 936.37 936.33 936.42 936.7 936.21 936.84 937.12

Cc09 943.28 938.35 936.1 936.75 937.51 937.47 936.84 937.12

Cc10 944.28 936.84 936.84 936.84 936.84 936.84 936.84 937.12

Looking at the results of the above two tests, we may conclude that the environment setup to perform experiments should allow clients to collectively achieve up to 200MB/s total bandwidth.

945.28 937.12 937.12 937.12 937.12 937.12 937.12 937.12

6.3 Regression Run Results Read performance of the system for single client varying record size is as shown in Figure 12. Here, we see that for record sizes below 64KB, the performance increases linearly with record size and the performance numbers make sense. But, after 64KB, the performance is erratic. On varying the number of servers from 1 to 3, we see results as shown in figures 13a-c. Looking at these graphs, we see that, as the number of servers increase the read performance of the system for syscall/record size > 64KB decreases. What could be the reason for this behavior?

Figure 12: Throughput vs Record size for file size 4MB, Stripe size 64KB. Single Client – 4 data servers The graph above shows the result of the same test run 3 times.

13 b

13 a

13 c

Figure 13: Throughput vs Record size for file size 512MB, Stripe size 64KB Figure a

•

Single Client -1 MDS - 1 Storage Nodes

Figure b

•

Single Client -1 MDS - 2 Storage Nodes

Figure c

• Single Client -1 MDS - 3 Storage Nodes The outputs shown in each graph are an average of 3 runs.

Since the stripe size of the system is 64KB, accesses of size <= 64KB go only to one data server. In this case, the results follow that of figure 12. But, for access size > 64KB,

requests go to more than one data server. This is what leads to the decreased performance and erratic behavior. One would like to believe that there is some problem in the pNFS over PVFS2 code that is responsible to access data that is split between different servers. However, analyzing the problem further and looking at the TCP Seqnum vs Time graph (figure [14]) for the run, we see that there is some problem in the network; TCP is not going as fast as it is supposed to. However, the setup tests that we ran in the beginning of the regression do not suggest any problem in the network. The reason for this is that Iperf does not simulate the kind of traffic that is generated by a parallel file system.

Figure 14: TCP sequence number vs Time The graph shows the TCP sequence number against time for a TCP stream between one client and one data server. The stream was taken while running a test to measure read throughput of a 512MB file with record size > 64KB to a single client from 4 data servers.

6.4 Net Saturation Tool (netSat) Based on the results discussed in the previous sections, we developed a network saturation tool (netSat) that simulates traffic similar to pNFS to test the network. In this tool, the client synchronizes the traffic coming from all data servers by asking each data server for a certain chunk of data and then waiting until all the servers have responded before asking for the next chunk. A detailed description of the netSat tool is available in user manual.

Figure 15: netSat Output - Throughput vs Number of Server The graph above shows compares the throughput available to 3 different clients as the number of data servers increase. 2 clients (cc01 and cc06), represented in pink and orange, are part of the experimental setup shown in figure 11, while platinum the client represented using blue is outside the original experimental setup. The same data servers were used for all 3 clients. The numbers represented in the above graph are an average of 3 runs.

Figure 15 shows the output of running this tool on our setup and also on a client machine that is outside our original setup. In our original setup, lines labeled cc06 (pink) and cc01 (orange), we see that, as the number of data servers increase, the throughput obtained from the network decreases from ~110MB/s to ~10MB/s, which is analogous

to the results we got for the pNFS system. However, for a machine that is outside the original network, line labeled platinum (blue), we see that the throughput for a single server is ~70MB/s, which remains constant for increasing data servers. This test shows us that the switch that connects the machines in our test environment does not provide adequate support for striped storage’s incast (many-to-one) traffic pattern [16], which causes the behavior seen in the previous section. The netSat test is part of regression test suite and can be run as part of environment check tests available in Stage 0 of the suite.

7

Future Work In our work, we have developed the first version of a performance regression and debugging tool. However, we see considerable room for enhancements and improvements. As part of future work, we would like to continue using current issues in file systems as test driver for performance debugging. We would also like to add more performance benchmarks to the regression suite. More debugging/tracing tools need to be added to the system. We would like to continue regression runs on different versions of the current testbed system and also try using this tool on a different testbed system. P-Reg treats a pNFS system as a black box and hence may not be able to cover all performance issues seen in the system. But we believe P-Reg covers a large domain by the amount of tracing we do. However, a better mechanism for tracing events in the system needs to be devised to provide relevant debugging information which may only be possible by looking closer into the system, closer than just a black box approach.

8

Summary and Conclusion This report describes P-Reg, a new performance regression suite and debugging tool for pNFS systems. P-Reg was designed and developed to provide the following features: • • • • •

Provide Summarized outputs to compare two different runs Provide information useful for debugging performance issues Archive data of previous runs Be easy to use Be extensible and modifiable

We believe P-Reg can be very helpful and adds value in performance testing and debugging.

References [1] IOzone bench mark, http://www.iozone.org/ [2] Iperf network performance benchmark, http://dast.nlanr.net/Projects/Iperf/ [3] Hdparm, http://sourceforge.net/projects/hdparm/ [4] Tcpdump, http://www.tcpdump.org/ [5] Ethereal, http://www.ethereal.com/ [6] Tcptrace, http://tcptrace.com/ [7] Sysstat, http://perso.orange.fr/sebastien.godard/documentation.html [8] Strace, http://sourceforge.net/projects/strace/ [9] Distributed shell, http:/www.netfort.gr.jp/~dancer/software/dsh.html.en [10] Gnuplot, http://www.gnuplot.info/ [11] PVFS2, http://www.pvfs.org/documentation.html [12] pNFS Problem Statement. Garth Gibson, Peter Corbett. Internet Draft, July, 2004.

http://bgp.potaroo.net/ietf/idref/draft-gibson-pnfs-problem-statement/ [13] Parallel NFS Requirements and Design Considerations. G. Gibson, B. Welch, G. Goodson, P. Corbett. Internet Draft, October 18, 2004. http://bgp.potaroo.net/ietf/idref/draft-gibson-pnfs-reqs/ [14] D. Garlan and D. Perry. Guest editorial, Special Issue on Software Architecture. IEEE, Transactions on Software Engineering, April 1995. [15] Dean Hidlebrand, Lee Ward and Peter Honeyman, Large files small writes, CITI technical report 04-2006. [16] David F. Nagle, Gregory R. Ganger, Jeff Butler, Garth Goodson, and Chris Sabol, Network Support for NetworkAttached Storage, Proceedings of Hot Interconnects 1999, August 18 - 20, 1999, Stanford University, Stanford, California, U.S.A.