- Applied mathematics in specific fields
Over the last few decades, we witnessed a spectacular evolution in processor design. Whereas 20 years ago, architects were building single-core processors with 100 million transistors at clock frequencies around 1.5 GHz, contemporary state-of-the-art processors can reach clock frequencies exceeding 4.5 GHz in a so-called turbo boost mode, and feature more than 24 cores, with multiple
threads running simultaneously on the same physical core.
Architects used the exponentially increasing number of transistors through Moore’s Law to speed up single-threaded performance through various performance enhancements, including speculation, out-of-order execution, larger
caches, etc. Nowadays, superscalar out-of-order processing is the defacto standard and improving single-threaded performance is becoming harder. Therefore, architects are using the available transistor count for scaling up the number of cores in a processor. By increasing the core count, the software architect is able to process data in parallel, thereby significantly increasing performance.
But speculation, deeper and wider pipelines, out-of-order processing of instructions, etc. not only make processors faster, they also make them more complex.
Due to the increased complexity, the need for fast and accurate tools to evaluate new processor designs is very high. The preferred tool, until now, for processor design has been detailed simulation. The level of accuracy for cycleaccurate simulation is very high, hence it may appear to be the perfect tool for taking important design decisions. However, since these simulators implement the design in great detail, they are also very time-consuming. This leads to impractical slowdowns when simulating long-running applications.
To simulate long-running applications, architects can no longer rely on detailed simulators and hence need to use functional and/or sampled simulation.
Unfortunately, even these simulation techniques become unfeasible when systems become too large or applications too long-running.
Techniques such as analytical models are a good way to complement detailed timing simulations. These models predict the execution time using a mathematical model that uses inputs derived from a simple functional simulation. It does this while maintaining high prediction accuracy. The interval model  is one example that is based on the observation that, without miss events, the performance or instructions per cycle (IPC) of the processor pipeline should be equal to the width of the pipeline. This means that the processor is able to process instructions at its maximum IPC, e.g., at the designed dispatch width, as long as there are no miss events. This behavior is best observed at the dispatch stage, where instructions leave the front-end of the pipeline (after fetching and
decoding is done) and enter the back-end, where the instructions are executed and all communication to memory is handled. The interval model models the impact of branch mispredictions, instruction- and data-cache misses.
The interval model is able to provide a fast and accurate way for predicting application performance, but with a growing design space, even the recurring cost of re-running the functional simulations with different configurations becomes a major concern. Our approach to this problem is the use of a microarchitecture independent profiling technique, such that this profile only consists of application characteristics that are independent of the microarchitecture the application is running on. After building the profile, microarchitecture-specific
inputs are generated based on a given hardware configuration, which enables the analytical model to predict performance. Constructing this profile is a onetime cost, thus the perfect solution for the exploration of large processor design
The key challenge is thus to create microarchitecture-independent profiling techniques to provide inputs to the interval model. To model cache behavior, a tool called StatStack  can be used, which uses reuse distance histograms to predict the cache miss rate of an application for any cache size. Some prior work has been done to model branch behavior of an application, e.g., taken rate and
transition rate [9, 24], and branch entropy by Yokota et al. . Unfortunately, this prior work is not able to provide an accurate prediction of the number of mispredicted branches for a specific branch predictor. Therefore, a new, more
accurate model is needed.
We propose linear branch entropy, a novel metric that quantifies how regular the branch behavior of an application is. An entropy of 0 means that there is a regular pattern in the branch behavior. Therefore, these branches
are easy-to-predict, leading to a low number of mispredicted branches. When entropy equals 1, this implies that there is a lot of randomness in the branch behavior, therefore these branches are hard-to-predict, leading to a high number
of mispredicted branches. We keep this the same as the definition of Shannon entropy, but we adapt the entropy formula from a sum of logarithmic functions to a simple linear function, and we demonstrate that our new approach
correlates better with how a branch predictor actually works.
Because different branch predictors have a different misprediction rate for the same application, each type of branch predictor has its own model. We find that a linear relationship between entropy and the measured branch misprediction rate is the best fit. We use the following equation to calculate the misprediction rate M for a given entropy E:
M(E) = α + β × E. (2)
Parameters α and β are calculated during a training run and are different for every branch predictor. We validate this new metric by predicting the misprediction rate for five branch predictors, e.g., GAg, GAp, PAp, gshare and a tournament predictor, for 95 benchmarks from two benchmark suites, SPEC CPU 2006 and CPB 2011. Across all modeled predictors, the average absolute error is around 0.70 MPKI and 0.89 MPKI for CBP and SPEC, respectively.
The average MPKI equals 10.8, which means that the model features a relative error of less than 10%.
The first use case of linear branch entropy is the ability to steer if-conversion. Linear branch entropy is an accurate metric for categorizing branches into hard-to-predict and easy-to-predict branches. This can be done on a per-branch
basis and is thus an effective tool to profile branches to steer if-conversion. The impact of branch behavior on the overall execution time of an application is influenced by the number of mispredicted branches. When branches are very hard to predict, the branch predictor is not able to avoid most mispredictions.
Thus, decreasing the branch impact should be done by avoiding this penalty, through if-conversion. By implementing this technique in the LLVM compiler, we improve performance by up to 2.4% compared to the default implementation
The second use case of linear branch entropy is the ability to provide inputs to the interval model to model the branch penalty. During the profiling run of the application, linear branch entropy is calculated. During the modeling phase, the model of the used branch predictor along with the entropy provides
a prediction for the misprediction rate. From this, we can derive the number of mispredicted branches, which in turn, can be used to calculate the impact of all mispredicted branches. The use of linear branch entropy to model the branch penalty is used as part of the single-threaded model proposed by Van
den Steen et al. . In this model, the error of the branch component is only 1% compared to simulation, leading to a 0.16% error on the complete execution time, whereas Yokota’s model, on average, leads to an underprediction of
around 60%, or to a 7% error when predicting overall execution time.
This microarchitecture-independent analyical model can predict the execution time of single-threaded applications, but in order to make the most out of current multicore processor hardware, applications need to adapt. Programmers can no longer rely on the computer architect to improve performance.
Therefore, applications need to adapt and parallelize the work to utilize the available hardware to the best possible extent. This adds yet another dimension and even more complexity to the design of both the hardware and the software. To enable the software architect to adapt their applications to upcoming multicore hardware, and to enable the hardware architect to optimize this new hardware, they need new tools that enable them to do a fast and accurate design space exploration, and that give them the necessary insight into the possible performance bottlenecks.
To tackle these new challenges, we propose RPPM, a micro-architectural independent modeling tool for rapid performance prediction of multi-threaded applications on multicore hardware. Multi-threaded applications create multiple worker threads that are able to execute in parallel. By dividing the work
among these worker threads, the application tries to speed up the execution.
But this comes with a cost and significant added complexity. These threads can and will interfere with each other, both directly through synchronization and indirectly through memory.
The first major difference between single- and multi-threaded applications is the interference threads have upon each other through synchronization. The most used synchronization primitives include critical sections, barriers and producer-consumer relationships. These functions are available through parallel execution libraries, such as OpenMP, pthread, etc. Interference through memory can occur in many ways and can both lead to an increase or a decrease of the execution time. To model interference through memory, RPPM
uses a new version of StatStack , that is able to predict cache miss rates for multi-threaded applications.
To model multi-threaded execution behavior, we catch all calls to the synchronization libraries during the profiling phase, together with all characteristics needed to model per-thread behavior and multi-threaded memory behavior.
To predict the execution time of multi-threaded applications, we first predict the execution time of all individual threads, between synchronization events.
This is done by extending the single-threaded model with multi-threaded StatStack. The next step is to predict the overhead caused by synchronization, which is done by reintroducing all synchronization events and by modeling
their execution behavior.
RPPM predicts the execution time of all Parsec and Rodinia applications with an average absolute error of 11%. This significantly outperforms naive extensions of the single-threaded model that do not model any synchronization, leading to an average error of 28%. We also show that RPPM is a valuable tool to speed up design space exploration and to gain insight into the execution behavior of multi-threaded applications on future designs.