Optimal Scalable Software Architecture for Symmetric Multi-Core ...

Viewer
Transcript

Optimal Scalable Software Architecture for Symmetric Multi-Core Embedded System Kumar Bipin, Sachin Verma, Gaurav Singh and Imran Khan Computing Platform and Tools STMicroelectronics Greater Noida, India {kumar.bipin, sachin.verma, gaurav-mmc.singh, imran.khan}@st.com Abstract—The demand of high computational bandwidth within the constraint of low energy consumption has inspired inclusion of SMP cores in embedded world. System designers attempting to achieve linear scalability of software execution on multi-core processors need to consider the Hardware Memory Model (memory access pattern, interconnect bandwidth, cache levels, snooping etc), SMP operating system (Linux kernel scalability: atomic-operation, locking, per-cpu data, Read-Copy-Update, Non-Blocking-Synchronization etc.) and Application Scalability (Programming language memory model, thread aware compiler etc.). This paper attempts to identify bottlenecks at each layer and presents case studies used to design optimal scalable solution for embedded SMP system. Keywords- SMP, Memory Model, Cache, Compiler, Data Consistency

I.

INTRODUCTION

Modern embedded systems must have high computational capability to support a wide range of applications. At the same time, such systems must dissipate as little power as possible in order to remain battery powered. As Parallel computing promises more computations per unit of energy dissipated, the embedded system industry is adopting Symmetrical Multi-core Processing (SMP) for future solutions. ARM and its strategic partners have introduced a symmetrical multi-core processor called Cortex-A9 with wide applications in the area of consumer electronics, home entertainment, automotive and medical fields. A key research challenge is to extract maximum performance offered by multi-core embedded systems. Current SMP solutions fall short of achieving theoretical maximum utilization of multi-core resources primarily due to the constraints of parallel computing implementation. While sharing of read-mostly data in parallel multi-thread programs theoretically improves performance and lowers total memory utilization, conflicting accesses on the shared memory location by different cores requires serialization with explicit synchronization semantics introducing computational overheads. Programmers maintain “Sequential Consistency for variables” and prevent “Data Races on shared variables” by use of synchronization primitives.

A. Sequential Consistency The execution of a multithreaded program consists of arbitrarily selecting a thread and performing next instructions in sequence where multiple executions of threads interleave with each other. The last value stored at any object or memory location would be returned when any thread accesses it. The concept of Sequential Consistency can be explained using Dekker’s mutual exclusive algorithm shown in Figure 1. It has interleaved instruction steps executed by two threads where shared variable X and Y return last stored value when they are read by any thread. Figure 2 shows all possible outcomes for non-shared variables r1 and r2. Many other interleaved executions are possible but none of them may output both r1 and r2 as zero. Such interleaved execution which starts with the first statement of any thread, can be viewed as sequentially consistent for all possible outcomes where both r1 and r2 are never zero together. Both compiler and hardware optimizations (compiler reordering of instructions and hardware read-ahead store buffers) may result in r1=0 and r2=0 as a possible outcome and this is sequentially inconsistent behaviour.

Figure 1. Dekker’s Algorithm

Identify applicable sponsor/s here. (sponsors) Figure 2. Execution Sequences

Sequentially inconsistent outcomes are due to compiler and hardware being unaware of variable sharing between threads. Most hardware and compilers today are not thread aware and cannot provide sequential consistency in all scenarios B. Data Race Free Another form of sequential consistency can be viewed as simultaneous access of a single object or memory location by two competing threads. For example, consider a write or a read-write of a 64-bit variable Z on 32-bit machine by two interleaving threads. Such type of conflicting parallel execution without strict ordering results in non-sequential behaviour that can be described as a data-race condition. Programming languages generally provide synchronization mechanism that enforces explicit ordering such as locks for limiting simultaneously access to variable by different threads. In addition to locks, modern programming languages provide mechanisms such as Java’s volatile variables or atomic variables of C++11 [8]. Such shared synchronization variables rely on the concept of a “system memory model” to correctly describe their implementation stack. It is impossible to use synchronization variables without understanding what a memory model represents. C. Memory Model Different layers translate, transform and optimize a program written in a high level language before it executes on hardware. The “memory model” clearly defines the semantics that each transforming layer must obey so that memory accesses are consistent and there are no unsafe optimizations that compromise data integrity. Memory model related decisions depend on various layers of the system. Programming languages, compilers, processors, memory interconnects must work in a cohesive manner to ensure high performance while maintaining correctness. In our research we focus on the memory model of ARM Cortex A9 based embedded systems. The ARM processor architecture supports weak hardware memory model and data integrity and performance issues may surface if software and tools do not adapt to the SMP challenge. II.

OVERVIEW

We have followed a bottoms-up approach in discussing SMP scalability –going from hardware, to system software, to application layer—using an experimental setup consisting of Linux running on a dual core ARM Cortex A9 based evaluation board. We first look at cache coherency as one of the key aspects affecting SMP system performance. The achieved performance of read/write operations issued from different cores on local caches demonstrates how Snooping between local CPU caches tries to cover-up for the slow accesses to outer caches and ultimately to external memory. Subsequently, we demonstrate how data structures or algorithms which do not take into consideration the design of local cache coherence fell in to the trap of cache ping-pong and false-sharing, and go on to discuss some solutions to this problem.

Our experiments illustrate the overhead of kernel synchronization primitives (atomic. spinlock, mutex), which prevent applications from achieving linear scalability. Later, we show how multithreaded applications written with the new standard of C++ language (that has a thread aware memory model) can scale far better than solutions written with traditional C++. III.

HARDWARE MEMORY MODEL

Hardware memory model defines how hardware may make writes visible to different threads at different time through write-buffer or shared cache (L1 and L2 cache). Most hardware provides relaxed memory model that is not strong enough to support sequential consistency completely. To bridge this gap, such models additionally provide “fence instructions” for software designers as an explicit mechanism to enforce sequential consistency. Hardware memory model designs are mostly driven by hardware optimizations (for example, branch prediction, instruction reordering based upon dependency graph etc.) to achieve optimal performance. The x86 documentation from AMD and Intel is ambiguous on the memory model issue; however, they express the intent but remain informal and are misinterpreted even by some industry experts. A similar trend is followed by the ARM specification in which we have much interest. ARM has a weak Hardware Memory model. Presence of Local caches along with different memory hierarchies makes software performance highly dependent on hardware ecosystem. It thus becomes imperative to analyse how Hardware memory model will impact software performance. A. Cache Coherency on SMP Embedded Systems ARM provides certain optimizations to basic MESI/MOESI cache coherency protocol like Direct data intervention (DDI) and Cache-to-Cache (C2C) migration. Analysing the exact behaviour of these hardware implementations can help us accrue performance benefits by attuning our software with the hardware cache coherency mechanism. In other words, we require to measure the expense of cache coherency mechanisms and also observe how different ordering of I/O operations (Read, Write, and Read-Write) could lead to better performance. To perform such an analysis we used a Cache to Cache performance benchmark tool called C2CBench, which is an open source tool to evaluate memory latencies at different hierarchies. C2CBench is based on a simple algorithm that uses the Producer-Consumer execution model, a logical representation of which is shown in the Figure 3. The producer thread works on one core and reads its input from memory into the local cache. The producer may modify the input or produce output in its local cache. After this, the consumer thread on the second core reads the producer’s data, which may involve transferring it to the consumer’s local cache via the coherence mechanism. The producer-consumer interactions are synchronized to guarantee correctness. Details of C2Cbench can be found in [13].

down as soon as our block size goes beyond 32KB which is the size of local data caches (cache line evictions hampering snooping benefit). We can see that throughput at Stride=4 is higher than at Stride=8, which is expected because with Stride=4 we still have the benefit of accessing two data words from the same cache line which is missing when Stride=8. In the second experiment, we modified consumer thread to perform read operations, retaining producer thread to perform the write operation. Figure 5 illustrates results for this use case. With Stride=1, as was the case in previous experiment, store buffer merging gives high throughput to producer but the consumer throughput is very low. As we increase Stride to 4, producer throughput drops significantly losing the store buffer benefits. Consumer on the other hand does have Snooping benefit. Result for Stride = 8 is pretty much in line with the results that we had for first experiment.

Figure 3. Logical Execution Model for C2CBench

We made use of C2CBench tool in a manner such that: •

2 threads 'producer' and 'consumer' are running on a dual core ARM Cortex A9 SMP with CPU affinity of the threads tied to different cores.

•

Each thread according to its role (producer or consumer) would execute the corresponding algorithm.

•

Both the threads are running simultaneously with producer running ahead than consumer maintaining exactly 1 block distance (which is called Lockstep execution in C2CBench terminology)

Our experimental setup comprised of an ARM Cortex A9 dual core SoC with each core having 32KB of L1 Cache (Data Cache and Instruction Cache), along with a unified L2 cache of 512KB. We restricted the data set to 512 KB so that only cache behaviour is recorded and trips to external memory are avoided. Producer/consumer algorithms were run with varying block sizes (from 8KB up to 512KB) and data strides (1, 4, and 8). Stride=1 implies all words would be accessed consecutively, Stride=4 implies every 4th word would be accessed. Throughput of I/O operation (Read, Write, ReadWrite) performed was recorded. Producer thread implies that it is going to access a block of data first (either read or write) so that it’s local cache would be populated first. Consumer acts only after producer accessed the data and it could either snoop data from producers cache or fetch it from outer cache (L2). In first experiment we executed both producer and consumer thread with Write operation (see Figure 4, y axis represents bandwidth, x represents block size). We could observe that consecutive writes (Stride = 1) leads to very high throughput, which can be attributed to the store buffers present in the ARM Cortex A9 architecture. But this benefit of Store buffers quickly dies down when we perform scattered memory access (Stride=4 and Stride=8) and throughput goes down quite significantly. Another thing to notice here is the benefit of snooping that is available to consumer thread, and this comes

Figure 4. Producer doing write, consumer doing write

Figure 5. Producer doing Write, consumer doing Read

Figure 6. Producer doing Read, consumer doing write

In the third experiment, we swapped I/O operations performed by the two threads, i.e. producer performing the read and consumer doing the Write operation (Figure 6). With Stride=1 the consumer had a high throughput as it is performing write operation and producer is lagging behind. Stride=4 and Stride=8 are pretty much similar to results of previous experiments, where consumer is getting benefit of snooping and cache line benefits vanishing at Stride=8. Our final experiment with this setup had both producer and consumer threads doing the read (Figure 7). This time around results are a bit different from earlier experiments with consumer getting only the snooping and cache line benefits (which eventually drops as stride increases). These experiments with C2Cbench provide some very useful insights into Cortex A9 Cache behaviour. We observe that access speeds for accessing physically contiguous memory addresses is typically four times faster than accessing elements in a sparse manner (Stride of 4). If the Stride increases to 8, contiguous memory access becomes 6 times faster. Writes are much faster than reads till the time store buffer is in picture (typically of the order of 6.5 times). So we can infer based on the statistics obtained from the above experiment is that, on a Cortex A9 SMP system keeping write memory accesses in our program stream contiguous (without loss of functionality), could give us a throughput increase of the order of 4 times in certain situations. Another observation is that in sparse access (Stride = 8), Reads have higher throughput than writes of the order of 1.5 times. B. Cache Ping-Pong Another related problem seen on shared memory systems with local caches is of false-line sharing where multiple cores are vying to access a cache line and at least one of those accesses is write. False-line sharing could create a performance

Figure 7. Producer doing Read, consumer doing Read

bottleneck for an otherwise well written parallel program. Hardware vendors aware of the issue try to alleviate it by performing few optimizations. But still there are certain overheads as illustrated in the experiment below. Each thread is updating a counter value at a specified index (unique to each thread) in memory locations pointed in array data. Total counting to be done by the program is 2*NUM_ITERATIONS. When only a single thread runs it updates the counter for 2*NUM_ITERATIONS at location data[0]. But when 2 threads run (tied to different cores by setting CPU affinity) each updates counters for NUM_ITERATIONS at data[0] and data[1], effectively parallelizing the work to be done. By logic of parallel programming since we split the work in 2 threads and both threads are independent, make no system calls and with no contention for a shared resource, we expect scaling factor to be equal to 2. But as Figure 8 shows the scaling factor is only 1.65. The problem lies in the way locations data[0] and data[1] are placed in memory and how local caches use memory. We know that caches deal with memory in cache lines and each cache line is 32, 64 or 128 bytes depending on the architecture. Contiguous memory locations end up on the same cache line. So whenever thread0 tries to write to data[0] the corresponding cache line is read in the local data cache of the thread0’s core. This cache line read brings with itself the data location of the other thread1 also. Thread1 will have to wait till thread0 has completed its work and this ping-pong repeats once thread0 is scheduled again. Solving such a cache ping-pong problem is non-trivial. One needs to careful while creating data structures and make sure that data members which are to be written to by different threads lie in different cache lines. ARM and other Hardware vendors have tried to optimize this by implementing snooping

mechanisms but it has its overheads. We tried to remove the problem in software by padding the data structure for “data” such that data[0] and data[1] lie in different cache line. When we reran the program with padded structures, performance with 2 threads scaled to almost 2 as shown in Figure 8 (which is the ideal result).

Figure 8. Cache False sharing effect on Scaling factor and a proposed solution

IV.

cost of spinlock—supposed to be a fine-grained lock on SMP—is much more than the other two. In the 2.6 kernel series very little code left under the global kernel lock (BKL). Most of code locking in common paths is converted into data locking and there is a lot of scalability tuning to eliminate shared cache line bouncing [8]. The 2.6 kernel brought new primitives such as RCU and per-cpu data, lock free algorithms for route cache and directory entry cache as well as scalable user-level APIs like futexes[5, 6, 7]. Linux Kernel 2.6 has introduced per-cpu variable which can be used consistently as almost lock-free. In case of per-cpu variable every CPU instance gets its own copy of a variable so there is no need to prevent these variables from concurrent access from different CPUs as long as each CPU works with its own copy. In our next experiment we show how implementing a counter using per-cpu data would achieve better linear scalability as compared to the similar counter implemented with locking primitive. In the code snippet shown in Figure 9, each thread runs on a different CPU and modifies its own copy of count without using any lock. Whenever total count is needed both these instances of count can be added. In lock based approach shown in Figure 10, each thread needs to acquire a lock before modifying the shared variable count. Even if both threads take atomic approach, in case of higher load we get improved performance using per-cpu variable as shown in Table II.

SYSTEM SOFTWARE LINUX OS

The support for SMP was introduced in 2.0 Linux Kernel, since then granularity of locking has been gradually improved towards finer locking for critical sections. In 2.4 kernel and subsequently in the 2.6 kernel, many of the global locks were broken up to improve scalability of the kernel. Another scalability improvement was the use of reference counting to protect kernel objects. This also helps in avoiding long critical sections. While such features benefit large SMP and NUMA systems, on embedded system, which would be quite small, benefit due to reductions of lock contention is minimal. We wanted to measure the overhead associated with some of the basic low level locking primitives provided by the Linux kernel such as Atomic variables, Mutex and Spinlock. These locking primitives are also the origin of several other locking techniques. For example RCU and Futex[5,6,7] are derived from atomic. Several read-write locks are based upon mutex and spinlock. TABLE I.

Figure 9. Per-CPU Data experiment

COST OF KERNEL SYNCHRONIZATION PRIMITIVE

Figure 10. Lock based approach

Table I shows the cost of synchronization primitives on ARM Cortex-A9 based evaluation board running Linux. The cost of atomic operation is much closer to that of mutex and the

Through this analysis we addressed a few areas of scalability enhancement in Linux kernel, which allows us to run Linux better on small SMP systems.

TABLE II.

V.

PER-CPU DATA VS LOCKING PRIMITIVE

PROGRAMMING LANGUAGE

The original design of native languages like C and C++ does not support multi-threading. Consequently system developers use thread libraries (e.g. POSIX threads) to parallelize programs written in native languages. To avoid data races and maintain sequential consistency, programmers “synchronize” threads using library locking primitives. External thread libraries and associated locks are generally considered sufficient for writing multi-threaded programs. However, there are two major drawbacks in this widely used approach: 1.

Program Correctness Ambiguity

Thread library standards impose memory barriers and rules for compiler re-ordering in an attempt to maintain datarace-free and sequential consistency. However, in spite of such rules, undefined and unexpected program behaviour has been observed in code written using thread libraries. Boehm [1] presents three cases where possible compiler and hardware transformations of seemingly correct code written with POSIX libraries led to violation of data-racefree condition. Therefore, programmers using Pthread based implementations of languages like C and C++ cannot be certain about the correctness of their program. Code that is well-tested and proven can give undefined behaviour once compiler version or underlying hardware is changed. 2.

executing threads. Additionally, memory models define where compilers can transform programmer directives in a multithreaded environment. Programmers using languages with well-defined memory models can write correct programs (that are sequential consistent and data-race-free) by simply using language constructs to disallow certain optimizations and reordering by the compiler. At the same time they can code for performance by explicitly telling the compiler where to relax synchronization rules. Realizing the importance of a memory model for multithreaded environments Java introduced a well-specified model in 2004. However, since performance-critical code in embedded systems1 (and many other systems for that matter) is still written in native language, the need for a C and C++ memory model was felt acutely when the multi-core revolution gathered momentum. B. C++11 and It’s Memory Model In 2011, a new C++ specification, called C++11, was released and among the new features was support for threading, mutex based locking and memory model for multi-threaded C++ programs. Compilers written as per C++11 specification must consider thread interaction before reordering variable accesses. A full list of threading and mutex support in C++ can be found here [8]. The language specification also defines atomic operations, to be used by expert system programmers to write high performance code. It is further claimed that fast data-race-free and lock-free algorithms can be written by using atomics. We will put this claim to the test and also prove why compiler and hardware must fully leverage C++11 atomic operations for the claim to hold true. Consider the well-known algorithm for finding prime numbers – Sieve of Eratosthenes (used by Boehm in [1] to show the expense of synchronization). Following is a modified C++ implementation of the core loop function:

Performance hit due to Lock based synchronization

Locking overheads have a significant effect when locks are used to synchronize threads in frequently used common libraries. Locking can also lead to dreaded deadlock situations where one malicious thread can block execution of an entire application. To boost application performance, embedded system programmers must have an option of writing lock free code. The ad-hoc use of Pthreads to support concurrency in native languages is looking unsustainable. Instead multi-threading support must be added at language level with a well-defined “shared memory consistency model” (or memory model) to define behaviour of shared variables. A. Programming Language Memory Model Programming language memory models describe how threads interact through memory and define how assignments to a shared variable in one thread will be seen by concurrently

Figure 11. Sieve of Eratosthenes algorithm

The above algorithm can be executed in parallel on an SMP system by using two threads executing the same loop. However, such an implementation is “unsafe” due to a datarace associated with sieve[] access. To prevent data-races and ensure application portability across hardware architectures and

1

Recent tests by Google confirmed that C++ outperforms Java in executing a well specified runtime performance benchmark [11].

compilers, concurrent reads and writes to sieve[] must be synchronized. From the execution time results shown in Figure 12, we observe that writing lock-free code using atomics gives significant performance benefits over using Pthread mutex and spinlocks. Figure 13 shows code disassembly for C++11 Atomics based code written for ARM Cortex A9 for Boolean flag base type. ARM memory barrier instruction “dmb” disallows unsafe compiler and hardware optimizations, thus maintaining thread-safety. The Pthread mutex based code’s disassembly (Figure 15) reveals that a time-consuming system call is used to achieve thread synchronization. Pthread spinlocks are implemented using a loop based double locking method that introduces unnecessary delays.

Compiler support for C++11 atomics is crucial. We found that the latest freely available GCC version for ARM architecture (GCC 4.6.2) does not support true atomic qualifier implementation for all base variable types – char, int and floatand instead the compiler inserts a simple placeholder (Figure 14). In contrast, the x86 version of the same compiler (GCC 4.6.2) has full implementation of atomic qualifier for all base variable types.

Figure 12. Sieve Execution time ARM Cortex A9 (secs).

Figure 14. C++11 Atomics Code Disassembly (“int” Base type).

Figure 13. C++11 Atomics Code Disassembly (“Boolean flag” base type).

Figure 15. Pthread Mutex Code Disassembly.

C. C++11 Atomics Memory Ordering By default atomic variables follow sequentially consistent ordering rules and all threads (running on different cores) will see the same order of changes on shared variables during program execution. In practice, this means that when a shared atomic variable is changed, all cores are synchronized using barrier instructions. Figure 16 shows sequentially consistent operations on atomics x and y.

certain locks that might not be useful on embedded SMP systems. In Section V on Programming Languages, we have significantly extended work on Thread library issues in native languages by Boehm [1] and C++11 Concurrency model by Boehm et. al .in [2]. We implemented theoretical concepts of native language memory model presented in these papers on embedded ARM Cortex A9 systems. Our experimental results using atomics prove that the C++11 memory model cannot succeed without compatible hardware memory model and compiler design. We go on to discuss the future significance of C++11’s memory model features presenting a possible application case that compiler and hardware designers should cater to. Prior work on Cache coherency and Hardware memory model effects again focuses on desktop and server systems. Daniel Molka et. al. [4] describe cache coherency effects on Intel processors however the benchmark used in this paper is not freely available. Our work on cache coherency effects in Section 3 is the first time such work has been done for ARM Cortex A9 based systems. Discussion of Hardware memory model implemented by ARM provides valuable insights about how SMP embedded systems should be programmed. Moreover, we have used an open-source C2CBench cache-tocache performance benchmark for performance analysis. TengFeng Yang et. al. [3] have performed cache scheduling experiments using a specific Intel library. On the other hand, we use a simple generic test to show how data use patterns effect performance of software libraries. We also discuss false sharing and its solutions.

Figure 16. Four threads synchronizing using C++11’s Sequential consistency rules.

Such system wide synchronization is expensive. C++11 provides the option to pick and choose which cores to synchronize. Consider a situation where we have four threads (on four different cores) and there is a need to synchronize shared variables for only two of these threads. Such a situation may be encountered when an application contains two graphics threads along with one audio thread and one UI thread. C++11 memory_order_acquire and memory_order_release ordering primitives can be used to perform pair-wise synchronization between the graphics threads, reducing the overhead. However, underlying hardware must provide fine-grained synchronization instruction to the compiler writer. VI.

RELATED WORK

There is very little published work in the area of Symmetric Multiprocessing embedded system performance scalability and most of the research focuses on desktop and server systems. Consider the work of Dipankar Sarma et.al. [10] that discusses scalability features of the 2.6 Linux kernel. The researchers show overhead due to locks and explain use of RCU on desktop Linux. However, they fail to consider the embedded Linux case. In our paper we look at eliminating

By viewing performance bottlenecks at every layer of system design, we are one of the very few research groups that have thoroughly analysed scalability of system software, hardware as well as application software. Nandan Tripathi et. al. also analyse bottlenecks at different layers in [12]. However, this recent work heavily relies upon opaque and costly Multibench benchmark programs and esoteric tracing techniques for analysing bottlenecks on embedded systems. VII. CONCLUSION We believe that there should be harmony and transparency between designs of hardware, system software and programming language memory models in order to achieve linear scalability. We show the performance benefits of writing lock-free code using C++11 atomics on an embedded ARM-based SMP system. Further, we foresee new hardware barrier instructions that give compiler writers proper tools to optimize C++11 code. Although Linux kernel has been optimized and is still being optimized from time to time we need to take into account the modifications that impact scalability on small SMP systems in order to achieve theoretical maximum utilization of available processing capability. On the hardware end, we have experimented with the memory hierarchy of an ARM Cortex A9 based system,

highlighting optimizations made by hardware to deal efficiently with shared variables. ACKNOWLEDGEMENT Special thanks are due to Rohit Dhawan for his technical input for Section III – Hardware Memory Model. We would also like to thank Gian Antonio Sampietro, Luca Furiato for their kind patience.

[5]

REFERENCES [1]

[2]

[3]

[4]

Hans-J. Boehm. 2005. Threads cannot be implemented as a library. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation (PLDI '05). ACM, New York, NY, USA, 261-268. DOI=10.1145/1065010.1065042 http://doi.acm.org/10.1145/1065010.1065042 Hans-J. Boehm and Sarita V. Adve. 2008. Foundations of the C++ concurrency memory model. In Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation (PLDI '08). ACM, New York, NY, USA, 68-78. DOI=10.1145/1375581.1375591 http://doi.acm.org/10.1145/1375581.1375591 Teng-Feng Yang; Chung-Hsiang Lin; Chia-Lin Yang; , "Cache-aware task scheduling on multi-core architecture," VLSI Design Automation and Test (VLSI-DAT), 2010 International Symposium on , vol., no., pp.139-142, 26-29 April 2010 doi: 10.1109/VDAT.2010.5496710 URL:http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=54967 10&isnumber=5496636 Molka, D.; Hackenberg, D.; Schone, R.; Muller, M.S.; , "Memory Performance and Cache Coherency Effects on an Intel Nehalem

[6] [7] [8] [9]

[10] [11] [12] [13]

Multiprocessor System," Parallel Architectures and Compilation Techniques, 2009. PACT '09. 18th International Conference on , vol., no., pp.261-270, 12-16 Sept. 2009 doi: 10.1109/PACT.2009.22 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5260544&is number=5260497 Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Katherine Yelick. 2009. A view of the parallel computing landscape. Commun. ACM 52, 10 (October 2009), 56-67. DOI=10.1145/1562764.1562783 http://doi.acm.org/10.1145/1562764.1562783 Robert Love Kernel Korner: Kernel Locking Techniques, Linux Journal, Issue 100, August 2002. P.E. McKenney, J. Appavoo, A.Kleen. O.Krieger, R. Russell, D.Sarma, M. Soni. Read-Copy Update, Ottawa Linux Symposium, July 2001. Anthony Williams. 2012. C++ Concurrency in Action: Practical Multithreading. Manning Publications; 1 edition (January 28, 2012). Bruce Jacob, Spencer Ng, and David Wang. 2007. Memory Systems: Cache, Dram, Disk. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. http://ols.fedoraproject.org/OLS/Reprints-2004/Reprint-SarmaOLS2004.pdf https://days2011.scala-lang.org/sites/days2011/files/ws3-1-Hundt.pdf http://www.design-reuse.com/articles/26126/analyzing-multithreadedapplications-multicore.html http://sourceforge.net/projects/c2cbench/

Enabling Software Management for Multicore Caches ...