Approximation of Worst-Case Execution Time for Preemptive Multitasking Systems
Matteo Corti1, Roberto Brega2, and Thomas Gross1
1 Departement Informatik, ETH Z¨urich, CH 8092 Z¨urich
2 Institute of Robotics, ETH Z¨urich, CH 8092 Z¨urich
Abstract. The control system of many complex mechatronic products requires for each task the Worst Case Execution Time (WCET), which is needed for the scheduler’s admission tests and subsequently limits a task’s execution time dur- ing operation. If a task exceeds the WCET, this situation is detected and either a handler is invoked or control is transferred to a human operator. Such con- trol systems usually support preemptive multitasking, and if an object-oriented programming language (e.g., Java, C++, Oberon) is used, then the system may also provide dynamic loading and unloading of software components (modules). Only modern, state-of-the art microprocessors can provide the necessary com- pute cycles, but this combination of features (preemption, dynamic un/loading of modules, advanced processors) creates unique challenges when estimating the WCET. Preemption makes it difficult to take the state of the caches and pipelines into account when determining the WCET, yet for modern processors, a WCET based on worst-case assumptions about caches and pipelines is too large to be useful, especially for big and complex real-time products. Since modules can be loaded and unloaded, each task must be analyzed in isolation, without explicit reference to other tasks that may execute concurrently. To obtain a realistic estimate of a task’s execution time, we use static analysis of the source code combined with information about the task’s runtime behavior. Runtime information is gathered by the performance monitor that is included in the processor’s hardware implementation. Our predictor is able to compute a good estimation of the WCET even for complex tasks that contain a lot of dynamic cache usage, and its requirements are met by today’s performance monitoring hardware. The paper includes data to evaluate the effectiveness of the proposed technique for a number of robotics control kernels that are written in an object- oriented programming language and execute on a PowerPC 604e-based system. Introduction
Many complex mechatronic systems require sophisticated control systems that are im-plemented in software. The core of such a system is a real-time scheduler that de-termines which task is granted access to a processor. In addition, such systems mayinclude a (graphical) user interface, support for module linking (so that a family ofrobots can be handled), a facility to dynamically load or unload modules (to allow up-grades in the field), and maybe even a network interface (to retrieve software upgradesor allow remote diagnostics). Such systems are not autonomous, i.e., there is exists a
well-defined response to unexpected events or break-downs, and such a response mayinvolve a human operator.
The core system shares many features with a real-time system, and as in a real-time
system, each task has a computational deadline associated with it that must be met bythe scheduler. To allow the scheduler admission testing, it needs to know the WorstCase Execution Time (WCET) for each task. This WCET, or the maximum duration ofa task, is of great importance for the construction and validation of real-time systems. Various scheduling techniques have been proposed to enforce the guarantee that tasksfinish before their deadlines, but these scheduling algorithms generally require that theWCET of each real-time task in the system is known a priori [4].
Modern processors include several features that make it difficult to compute the
WCET at compile time, such as out-of-order execution, multiple levels of caches,pipelining, superscalarity, and dynamic branch prediction. However, these features con-tribute significantly to the performance of a modern processor, so ignoring them is notan option. If we make always worst-case assumptions on the execution time of an in-struction, then we overestimate the predicted WCET by an order of magnitude andconsequently obtain only a poor usage of the processor resources.
The presence of preemptive multitasking in the real-time system worsens this over-
estimation because the effects of a context switch on caches and pipelines cannot be eas-ily foreseen. The difficulties to analytically model all aspects of program execution ona modern microprocessor are evident if we consider the requirements of controllers formodern mechatronic systems. In addition to real-time tasks, the controller must man-age a dynamic set of other processes to handle computations that are not time-criticalas well as input/output, including network and user interaction.
In such a dynamically changing system, the constellation of real-time and other
tasks cannot be predicted. Since all these tasks occupy the same host together, the effectsof preemption are difficult, if not impossible, to predict. And a purely analytical off-linesolution to the problem is not practical.
Our approach to this problem consists of two parts: We determine an approximation
of the WCET for each task. Then the control systems decides, based on the WCET,if the task with its deadline can be admitted, and if admitted, monitors its behavior atruntime. Since the WCET is only an estimate (although a good one, as we demonstratein this paper), the scheduler checks at runtime if a task exceeds its estimated WCET. A basic assumption is that the system can safely handle such violations: The schedulerattempts to meet all deadlines, but if during execution a deadline is missed since atask takes longer than estimated, then there exists a safe way to handle this situation. This assumption is met by many industrial robot applications. In the absence of a well-founded WCET estimate, the system must be prepared to deal with the developer’sguess of the WCET.
To determine a realistic estimate of the WCET, our toolset integrates source code
analysis of statistical dynamic performance data, which are gathered from a set of rep-resentative program executions. To collect runtime information, we use a processor-embedded performance monitor. Such monitors are now present in many popular archi-tectures (e.g., Motorola PowerPC, Compaq Alpha, and Intel Pentium).
Although this approach does not establish a strong bound on the execution time
(cost-effective performance monitors provide only statistical information about the ex-ecution time), we obtain nevertheless an estimate that is sufficiently accurate to be ofuse to a real scheduler. Problem Statement
The objective is to develop a tool to determine a realistic approximation of the max-imum duration of a given task on a modern pipelined and superscalar processor. Thesystem that executes the task is controlled by a real-time operating system that includespreemptive multitasking. Yet an unknown and dynamic set of other processes that runconcurrently with the real-time tasks prevents an off-line a priori scheduling analysis. The state of the processor and the caches is unknown after a context switch. We as-sume that the execution platform (or an equivalent system) as well representative inputvectors are available for measurements.
The computation of the approximate WCET can involve the compiler and, if nec-
essary, the user, but the structure of the tool must be simple and suitable for use byapplication experts, not computer scientists. Therefore interaction with the tool by theprogram developer should be minimized.
To support modern program development techniques, we want to be able to handle
programs written in object-oriented languages. Such programs have a number of prop-erties that make integration into a real-time environment difficult. But the use of such alanguage is a requirement imposed by the application developers [2]. Since the compu-tation of a task’s maximum duration is in general undecidable, we impose restrictionsin the programs that are considered: For those loops and method invocations for whicha compiler cannot determine the trip count or call chain, the user must provide a hint. (Recall that the real-time system checks at runtime if a task exceeds its computed max-imum duration. The consequences of incorrect hints are detected.) Furthermore, we donot permit the use of the new construct in a real-time task, since this operation is notbounded in time or space. These restrictions are accepted by our user community thatwants to take advantage of the benefits of an object-oriented programming style but cancomfortly live without storage allocation in a real-time controller.
Our paper presents a solution to this problem that handles programs written in
Oberon-2 [24], a Pascal-like language with extensions for object-orientation. The so-lution presented addresses the problems that are faced by users of similar languages,e.g., Java. System Overview
The tool for the WCET approximation is implemented as part of the XOberon pre-emptive real-time operating system and compiler developed at the Institute of Robotics,Swiss Federal Institute of Technology (ETH Z¨urich) [2, 6]. The execution platform is aset of Motorola VME boards with a PowerPC 604e processor [12]. We summarize herea few aspects of this platform. XOberon is loosely based on the Oberon System [33].
and includes a number of features to support the rapid development of complex real-time applications; e.g., the system is highly modular (so different configurations can becreated for specific development scenarios), and the system recognizes module inter-faces that are checked for interface compatibility by the dynamic loader.
XOberon includes a deadline-driven scheduler with admission testing. The user
must provide the duration and the deadline of a task when submitting it to the sys-tem. The real-time scheduler preallocates processor time as specified by the dura-tion/deadline ratio. If the sum of all these ratios remains under 1.0, the scheduler acceptsthe task, otherwise the task is rejected. The scheduler is also responsible for dispatchingnon–real-time tasks, which are brought to the foreground only when no other real-timetask is pending or waits for dispatch.
The system handles non-compliance by stopping any task that exceeds its time
allotment. Such a task is removed from the scheduling pool, and the system invokesan appropriate exception handler or informs the operator. This behavior is acceptablefor applications that are not life-critical and that allow stopping the complete system. The use of interrupts is deprecated, and the system and applications predominantly usepolling to avoid unexpected interruptions during a task’s execution.
WCET prediction proceeds as follows: The source of a real-time task is first com-
piled, and the resulting object code is run on the target platform with some traininginput. The execution of this real-time task is monitored in the actual system with thesame set of processes that will be present in the final system. Thereafter, our tool usesthe statistics from the training runs, as well as the source code, to approximate theWCET. Related Work
The importance of WCET computation has rapidly grown in the past years and at-tracted many research groups [30]. These efforts differ in the programming languagethat is supported, the restrictions that are imposed on programs admitted for real-timescheduling, the (real-time) runtime system that controls the execution, and the hard-ware platforms that are used. We first discuss programming language aspects and thenconsider different approaches to deal with modern processors.
Kligerman and Stoyenko [15], as well as Puschner and Koza [28], provide a detailed
description of the constraints that must be obeyed by the real-time tasks. The require-ments include bounds on all loops, bounded depth of recursive procedure calls, andthe absence of dynamic structures. To enforce these restrictions they introduce speciallanguage constructs into the programming language, Real-Time Euclid and MARS-C,respectively. Other groups use separate and external annotations and thereby decouplethe programming language from the timing specifications [23, 27].
We choose to embed information about the real-time restrictions—i.e., the maxi-
mum number of loop iterations and the cycle counts of any input/output—in the pro-gram source and to integrate our predictor with the compiler. This approach gives thepredictor a perfect knowledge of the mapping between the source code and the objectcode produced, even when some simple optimizations take place.
To avoid a lot of annotations in the source file, we perform a loop termination analy-
sis that computes the number of iterations for many loops. This compiler analysis phasesimplifies the work of the user and reduces the necessity of specific tools for the con-straint specification [16]. The implementation of this optimization is greatly eased bythe absence of pointer arithmetic and the strong typing embodied in the programminglanguage (Oberon-2 in our system, but the same holds true for other modern languages,e.g., Java).
The second aspect of computing the WCET is dealing with the architecture of the
execution platform. The first approaches in this field were based on simple machineswithout caches or pipelines. Such machine models allow a precise prediction of theinstruction lengths, because the execution time of a given instruction is always known. However, the rising complexity of processor implementations and memory hierarchiesintroduces new problems to deal with. The execution time of instructions is no longerconstant; instead it depends on the state of the pipelines and caches. Several researchgroups have investigated worst-case execution time analysis for modern processors andarchitectures.
For fixed pipelines, the CPU pipeline analysis is relatively easy to perform because
the various dependences are data-independent. Several research groups combine thepipeline status with the data-flow analysis to determine the WCET [8, 19, 22, 34]; thisapproach can be extended to the analysis of the instruction cache, since the behavior ofthis unit is also independent of the data values [8, 19, 32].
Data caches are a second element of modern architectures that can severely influ-
ence the execution time of load and store instructions. To precisely analyze the be-havior of the data caches we need information about which addresses are accessed bymemory references. If we treat each access as a cache miss, then the predicted time istremendously overestimated (given the differences between the access time to cache andmemory). We can attempt to reduce this overestimation with data-flow analysis tech-niques [13, 19] relying on control of compiler or hardware. The results reported highlydepend on the examined program and on the number of dynamic memory accesses thatare performed; as the program’s complexity grows, the predictions and actual executiontimes differ significantly.
Another approach to the data cache problem focuses on active management of the
cache. If we can prevent the invalidation of the cache content, then execution time ispredictable. One option is to partition the cache [14, 20] so that one or more partitionsof the cache are dedicated to each real-time task. This approach helps the analysis ofthe cache access time since the influence of multitasking is eliminated, but on the otherhand, partitioning limits the system’s performance [21].
Techniques exist to take into account the effects of fixed-priority preemptive
scheduling on caches [17, 3], but the nature and flexibility of the process managementof XOberon (see Sect. 3) prevents an off-line analysis with acceptable results.
All these approaches try to reduce the WCET solely by static analysis. The results
vary greatly depending on the examined program: If the number of dynamic memoryoperations is high, such as in linear algebra computations that involve matrices, theprediction may significantly exceed the real maximum duration.
Good results can be obtained for straight-line code [31] (i.e. code without loops
and recursive procedure calls), but problems arise when, such as in our case, the sys-tem provides preemptive multitasking and does not enforce special restriction on thereal-time task structure. The program flow can be interrupted at any time, flushing thepipeline and forcing caches to be refilled. Such dynamic behavior cannot be handled bypure static analysis, because the resulting worst-case execution time would be too largeto have any practical relevance. Out-of-order execution and branch prediction are othercharacteristics that increase the number of variables that change dynamically dependingon the input data.
The novelty in our approach consists in the use of a runtime performance monitor
to overcome the lack of information caused by preemption and by the dynamic natureof the XOberon system. We monitor the use of pipelines and caches at runtime, andwe feed these data in the predictor to approximate the cycle counts of each instruction. This information is then combined with the insight obtained from static analysis togenerate the final object code for the real-time task. Multitasking effects, like cacheinvalidations and pipeline flushes, are accounted by the gathered data. This approachnot only minimizes the impact of preemption but also helps us to deal with dynamiccaching, out-of-order execution, and speculative execution in a simple, integrated, andeffective way. Static Analysis of the Longest Path
To determine the WCET of a task, we compute the longest path on a directed andweighted graph representation of the program’s data flow. The nodes of the graph repre-sent the basic blocks, while the edges represent the feasible paths. Each edge is weightedwith the length (number of cycles) of the previous basic block.
The WCET corresponds to the longest path of the graph. This path is obviously
computable only if the maximum number of iterations of each loop is known and thereare no direct or indirect recursive method invocations. This technique results in an over-estimation of the WCET, since mutually exclusive control flow branches are ignored. The false path problem is in general undecidable, but heuristics to deal with the problemhave been proposed [1].
To keep the user involvement in computing the WCET as low as possible, the
WCET estimator analyzes the source code trying to automatically bound the maximumnumber of loop iterations. The user needs only to specify the cost (in number of cycles)of the memory mapped input/output operations as well as the loop bound for thoseloops, for which the tool is unable to compute a limit on the number of iterations. Thissituation can occur when the loop termination condition depends on input values. Forthe majority of the programs, however, the compilation and analysis pass is successfulin determining the maximum number of iterations.
Although our analysis is simple (many compilers employ more sophisticated tech-
niques to identify induction variables [25]), we obtain good results because the major-ity of the loops in real-time software have a fixed length and a for-like structure. In oursuite of test programs (see Sect. 8) only a few loops needed a hint in the form of a boundspecification.
Our tool reports loops that defy analysis; for those loops, a hint is required spec-
ifying the maximal number of iterations. Such hints are syntactic comments, and themodified programs can still be compiled with any Oberon-2 compiler.
When the user manually specifies a loop bound, the compiler generates code to
check at runtime that the hint is correct. If the loop bound is exceeded during the execu-tion, an exception is raised and the program is halted, helping the developer to identifya misleading hint.
Although our simple solution gives very good results, as the vast majority of the
loops in the tested programs was automatically bounded, many improvements are pos-sible. There exist algorithms to deal with loops with multiple exits; other researchershave developed solutions to compute the trip count for loops with multiple inductionvariables or to predict the execution time of loops where the number of iterations de-pends on counter variables of outer-level loops [9, 10].
Object orientation may cause problems when determining the WCET, because at
compile time, the type of a given object and the dynamic cycle costs of its methods arenot known. This problem is especially annoying when we consider invocation of I/Oroutines, which involve the runtime system and are beyond the scope of the compiler. The XOberon system uses a database of objects, retrieved by name, to store the differentdrivers. We introduce a new construct to allow the programmer to insert a hint thatexpresses the cost of a driver method call (measured in processor cycles).
When each loop in the graph representation of the task has been bounded, we sum
the cycle counts of all instructions in each basic block (details in Sect. 6) and recordthis sum as the weight of the outgoing edges of each node. To compute the longestpath on the graph, this graph must be transformed into a directed acyclic graph (DAG)by unrolling the loops: The back-edge of the loop is removed, and the weights of theinternal edges are adjusted to take the number of repetitions into account. This loopunrolling is done only for this computation and has no effect on the generated code. Instruction Length Computation
Once the number of iterations of each basic block is known, the dynamic length ofeach instruction in the block is computed. Since the system structure obscures a preciseknowledge of the cache and pipeline states, the length of each instruction is approxi-mated using run-time information gathered with a hardware performance monitor. Thismapping of monitor information into execution costs is described in the following sec-tions. Runtime Performance Monitor
The Motorola PowerPC 604e microprocessor provides hardware assists to monitor andcount predefined events such as cache misses, mispredicted branches, and issued in-structions [12, 18]. Because the scheduler switches a processor’s execution among mul-tiple processes, and because statistics about a particular process may be of interest, aprocess can be marked for runtime profiling. This feature helps us to gather data onlyon the task of interest and avoids interference from other processes.
The performance monitor uses four special registers to keep statistics on the speci-
fied events. These registers are read each time the scheduler is activated to avoid addi-tional interrupts. The overhead of the performance monitor routine in the scheduler issmall (approximately 30 processor cycles). Monitoring is under user control and there-fore does not effect system performance, unless the user wants to gather performancedata.
The performance monitor was not specifically designed to help in program analy-
sis but to better understand processor performance. Event counting is not precise, andevents are not correctly attributed to the corresponding instructions; this is a commonproblem for out-of-order execution architectures [5], and the monitor of the PowerPC604e exhibits this problem. For a given event, we can therefore keep just summary dataover a large number of instruction executions. This information is sufficient to evaluatehow the processor performs but is too coarse to precisely understand the correlationbetween a particular event and the instruction generating it.
Many events are not disjoint, and the performance monitor does not report their
intersection. Consider, e.g., stalls in the dispatch unit: Seven different stall types arereported, but there is no way to know how many apply to the same instruction. Otherproblems arise when trying to monitor the number of stalls in the different executionunits. The performance monitor reports the number of instruction dependences presentin the unit’s corresponding reservation station. Unfortunately a dependence of one cyclebefore the execution does not necessarily imply a stall, because the dependence could beresolved in the meantime. In addition, the PowerPC 604e has split the branch predictionunit (BPU) included in the design of its predecessor (PowerPC 604) into two such unitsand includes a condition register unit (CRU); see Fig. 1. However the performancemonitor events were not updated consistently. The CRU shares the dispatch unit withthe BPU, and these two units are treated as a single unit by the performance monitor. Fig. 1. 604e pipeline structure.
Our WCET approximator deals with these inaccuracies of the performance monitor
and tries to extrapolate useful data from information delivered by the hardware monitorto approximate the time it takes to execute each basic block.
The performance monitor is able to deliver information about the cache misses
(missload, missstore), the penalty for a cache miss (penaltyload), the idle and busycycles for each execution unit (idleunit, execunit), the load of each execution unit(loadunit), the dependences in the execution unit’s reservation stations (pdep), and thestalls in the dispatch unit (pevent).
The length of each basic block (in cycles) is computed by summing up the lengths of
the instructions in the block. The length of each block is then multiplied with the numberof times the block is executed (see Section 5). We now discuss how the instructionlengths are approximated with help of run-time information. CPI Metric
The simplest way to deal with processor performance is to use the popular instructionper cycle (IPC) metric, but IPC is a poor metric for our scenario, because it does not pro-vide any intuition about what parts contribute to (the lack of) performance. The inverseof the IPC metric is the cycles per instruction (CPI) metric, i.e. the mean dynamic in-struction length. The advantage of the CPI metric over the IPC metric is that the formercan be divided into its major components to achieve a better granularity [7].
The total CPI for a given architecture is the sum of an infinite-cache performance
and the finite-cache effect (FCE). The infinite-cache performance is the CPI for the coreprocessor under the assumption that no cache misses occur, and the FCE accounts forthe effects of the memory hierarchy:
F CE = (cycles per miss) · (misses per instruction)
The misses-per-instruction component is commonly called the miss rate, and the cycles-per-miss component is called the miss penalty. The structure of the PowerPC perfor-mance monitor forces us to include data cache influences only in the FCE. The instruc-tion cache miss penalty is considered as part of the instruction unit fetch stall cycles(see Sect. 6.4).
The FCE separation is not enough to obtain a good grasp of the processor usage by
the different instructions, and we further divide the infinite-cache performance in busyand stall time:
where the stall component corresponds to the time the instruction is blocked inthe pipeline due to some hazard. The PowerPC processor has a superscalar pipelineand may issue multiple instructions of different types simultaneously. For this rea-son we differentiate between the stall and busy component for each execution unit;let parallelism be the number of instructions that are simultaneously in the respectiveexecution units, stallunit be the number of cycles the instruction stalls in the execution
unit, stallpipeline be the number of cycles the instruction stalls in the other pipelinestages, and execunit the number of cycles spent in an execution unit.
For each instruction of a basic block we use (3) to approximate the instruction’s dy-namic length. The difficulty lies in obtaining a correct mapping between the staticdescription of the processor and the dynamic information gathered with the runtimeperformance monitor to compute the components of (3). The details are explained inthe following paragraphs. Finite Cache Effect
The PowerPC 604e performance monitor delivers precise information about the meannumber of load and store misses and the length of a load miss. Because of the similarityof the load and store operations, we can work with the same miss penalty [12] andcompute the FCE as follows:
F CE = (missload + missstore) · penaltyload
Pipeline and Unit Stalls
Unfortunately, the performance monitor does not provide precise information about thestalls in the different execution units and the other pipeline stages. The performancemonitor was not designed to be used for program analysis, and the predictor must mapthe data gathered to something useful for the WCET computation.
The first problem arises in the computation of the mean stall time in the different
execution units. As previously stated, the performance monitor delivers the mean num-ber of instruction dependences that occur in the reservation stations of the executionunits. This number is not quite useful, because a dependence does not necessarily causea stall, and we must find other ways to compute or approximate the impact of stalls.
The mean stall time for instructions executed by a single-cycle execution unit is
easily computable as the difference between the time an instruction remains in the exe-cution unit and its normal execution length (i.e. 1). Let idleunit be the average idle timein a cycle (for an execution unit), and loadunit be the average number of instructionshandled in a cycle (at most 1):
Unfortunately the execution time of an instruction in a multiple-cycle execution unitis not constant, and (5) cannot be used. Thanks to Little’s theorem applied to M/M/1queues we can compute the mean length of the reservation station queue Nq (at most2) and then approximate the mean stall time stallunit as follows (let ρ = 1 − idleunit
be the unit’s utilization factor (arrival/service rate ratio), and pdep the reported number(probability) of dependences):
(1 − stallunit · loadunit)Nq = 1 − pdep
We assume that an absence of reported dependences corresponds to the absence of stallsin the execution unit (7).
A similar problem is present in the computation of the mean number of stall cy-
cles in the dispatch unit: Seven different types of stall are reported for the whole four-instruction queue. We have no way to know how many instructions generated stalls, orhow many different stalls are caused by the same instruction. These numbers, however,are important for the understanding of the dynamic instructions lengths, because theyinclude the effects of instruction cache misses and influence all the processed instruc-tions. Several experiments with a set of well-understood test programs revealed that areported stall can be considered as generated by a single instruction. With pevent thereported number of stalls of a given type and
instruction stalls, we can compute the probability that an instruction stalls:
The stalls in the dispatch unit (stalldispatch) also include stalls in the fetch stage
and can be substituted for stallpipeline in (3). Execution Length
The normal execution length (i.e. the execution time without stalls) of an instructionfor single-cycle units is known from the processor specifications, but the multiple cycleunits (FPU, MCIU, and LSU: Floating Point Unit, Multiple Cycle Integer Unit, andLoad Store Unit) have internal pipelines that cause different throughput values depend-ing on the instruction type. As a consequence there is no way to know the length of agiven instruction, even if we do not consider the caches and pipeline stalls. The pres-ence of preemption, and consequently the lack of knowledge about the pipeline status,forces us to approximate this value.
The architecture specifies the time to execute an instruction (latency) and the rate
at which pipelined instructions are processed (throughput). We consider an instructionpipelined, in the execution unit, if there is an instruction of the same type in the last dunitinstructions (execunit = throughput), otherwise its length corresponds to the integralcomputation time (execunit = latency); the value of dunit is computed experimentallyand varies—depending on the unit—from four to eight instructions.
Since our instruction-length analysis is done on a basic block basis, the approxima-
tion for the first instruction of a given type in a block is computed differently:
pinstruction = 1 − (1 − loadunit)dunit
Where pinstruction is the probability that a given instruction type is present in the lastdunit instructions.
To check how this approximation works in practice and to refine the dunit and
threshold values, we compared for some test programs known mean instruction lengths(obtained with the runtime performance monitor) with the predicted ones, confirmingthe soundness of the method. Instruction Parallelism
We compute the mean instruction parallelism parallelism by considering mean valuesover the whole task execution, due to the impossibility to get finer grained informa-tion. Furthermore, the task cannot be interrupted to check the performance monitor;otherwise interference in the task’s flow would destroy the validity of the gathered data. The FCE is taken into account only for the instructions handled by the load/store unit(utilLSU ). Equation 3 can be used again with the mean execution time (exec) and meanCPI:
CP I − stalldispatch − utilLSU · F CE
To compute the mean instruction’s execution length exec (needed by (12)), the meanstall value computed earlier can be used. We first compute the mean execution time ofthe instructions handled by a given execution unit (execunit)—in the same way as thestalls are approximated for the single cycle units.
Finally, we compute the total mean instruction length (exec) as the weighted mean ofall the instruction classes:
We have now all the elements (execunit, stallunit, parallelism, stallpipeline andF CE) needed to compute the length of each instruction of a basic block by apply-ing (3). Discussion
We presented an overview of the approximations used to integrate information obtainedfrom the runtime performance monitor with the computation of the dynamic instructionlength. The use of probabilistic information and queuing theory does not guaranteeexact results, but the presence of preemption naturally prevents a precise analysis, soan approximation is acceptable in this environment. The approximations described herewere refined by checking the results of several experimental test programs what aredesigned to stress different peculiarities of the processor.
This approach could be improved using better suited hardware for the runtime mon-
itor, but as the results (Sect. 8) indicate, the PowerPC 604e performance monitor issufficient to approximate the WCET with a reasonable precision.
The following simple example shows how the gathered information is used to computethe length of the different instructions:
4 .LOOP: ldw r3,r4(r5) // the loop is executed 1000 times
Table 1. Sample data gathered by the runtime performance monitor
Table 1 shows some sample data returned by the performance monitor for this codesnippet. The impossibility to insert breakpoints in the code and get fine-grained infor-mation with the performance monitor forces us to compute global mean values for thestalls and the parallelism. Using the formulae presented in Sect. 6, we are then able to
approximate all the missing values (stallunit, parallelism, and stallpipeline).
In the same way we compute the stallunit for each execution unit (see Table 2) getting
Table 2. Stalls and mean execution times
a parallelism of 2.88 and pipeline stall time of 0.09 cycles (the performance monitorreports overall CPI of 0.5 and the effects of data cache as 0.001 cycles per instruction).
parallelism = CPI − stalldispatch − utilLSU · FCE
These values are used for both addi instructions (lines 3 and 5), since we are not ableto precisely assign the stalls to the correct instruction. We estimate the existence of astall for the first addic instruction (line 3) but since this instruction is executed onlyonce, the effect on the global WCET is not significant.
This small example shows how we assign stalls to the different instructions and how theprecision of the performance monitor influences the results: The smaller the profiledcode segments, the more accurate is the WCET estimation. The global homogeneityand simplicity of real-time tasks permits the computation of a good approximation ofthe WCET even with coarse statistical data, but better hardware support would surelyhelp to allocate these cycles to the correct instructions.
Testing the soundness of a WCET predictor is a tricky issue. To compare the computedvalue to a measured maximum execution time, we must force execution of the longest
path at runtime. Such control of execution may be difficult to achieve, especially for bigapplications, where it may be impossible to find the WCET by hand. By consequencethe WCET that we can experimentally measure is an estimation and could be smallerthan the real value.
We generated a series of short representative tests (with a unique execution path) to
stress different characteristics of the processor and to check the approximations madeby the predictor. This kind of tests provides a first feedback on the soundness of thepredictor, and they are useful during the development of the tool, since the results andpossible errors can be easily understood. These first tests include integer and floating-point matrix multiplications, the search for the maximum value in an array, solvingdifferential equations with the Runge-Kutta method, polynomial evaluation with theHorner’s rule, and distribution counting. Table 3 summarizes the results we obtainedapproximating the WCET of these tasks. We computed the predicted time using datagathered on the same environment used for the measured time tests, i.e. the standardXOberon system task set. Table 3. Results for simple test cases.
Measured time Predicted time Prediction error
For the majority of the tests we achieved a prediction within 10% of the real value,
and we avoid significant underestimations of the WCET.
The imprecision of some results is due to several factors. The biggest contributor
is the inaccuracy of the runtime performance monitor, which clearly was not designedto meet our needs. The inability of the performance monitor to provide data on a morefine grained basis, e.g., such as basic blocks, forces us to use mean values. However, thecombination of averages does not capture all aspects of the behavior of the completeprogram execution.
The importance of reducing the WCET by approximating some of the instruction
length components with run-time information is shown in Table 4. The table shows theresults obtained with conservative worst-case assumptions for the memory accesses:The obtained WCET is too high and unrealistic to be used in practice especially con-sidering that many other aspects as the pipeline usage should be considered.
These benchmark tests do not say much about the usability of our predictor for
real applications, because the code of these tests is quite homogeneous. Therefore, weprofiled several real-time tasks of different real applications with a standard set of non–real-time processes handling user interaction and network communication. Table 4. Approximations importance
Matrix multiplications Array maximum Polynomial evaluation
Testing big applications is clearly more difficult since the longest path is often un-
known, and the maximum execution time must be forced by manipulating the program’sinput values.
Trigonometric functions are a good example of the problem: The predictor com-
putes the longest possible execution, while, thanks to optimized algorithms, differentlengths are possible at runtime. It is obviously difficult, or often even impossible, tocompute an input set such that each call to these functions takes the maximum amountof time. We must therefore take into account that the experimentally measured maxi-mum length is probably smaller than the theoretical maximum execution time. On theother hand, tests with big applications are good for demonstrating the usability of thetool in the real world.
In principle, these real-life programs could be tested using a “black box” approach
with the execution time as result. Then testing consists of maximizing the result func-tion, i.e., the dynamic path length, varying the input data and submission time [29, 26]. Unfortunately, this approach is not feasible with real applications where the amount ofinput data is huge, and some execution paths may damage the hardware (robot) or oper-ator. We therefore are required to exercise some discretion in controlling the executionof the application kernels. Tables 5 to 7 report the maximum measured execution timewhen trying to force the longest path and the predicted WCET approximation.
The first application tested is the control software of the LaserPointer, a laboratory
machine that moves a laser pen applied on the tool-center point of a 2-joints (2 DOF)manipulator (Table 5). Table 5. Results for LaserPointer.
The second application being tested is the control software of the Hexaglide, a par-
allel manipulator with 6 DOF used as a high speed milling machine [11]. The two real-time tasks contain a good mix of conditional and computational code (Table 6).
The Robojet is a hydraulically actuated manipulator used in the construction of
tunnels. This machine sprays liquid concrete on the walls of new tunnels using a jet asits tool. We profiled three periodic real-time tasks that compute the different trajectoriesand include complex trigonometric and linear algebra computations (Table 7). Table 6. Results for Hexaglide. Table 7. Results for Robojet.
For the majority of the benchmark tests we get good results, with an imprecision in
the order of 10 percent. The controller handler of the LaserPointer and the dynamicshandler of the Hexaglide tests introduce some imprecision, due to the large numberof trigonometric functions present in their code; different executions of these functionhave vastly varying execution times, and this property complicates the measurement ofthe real worst case.
In summary, for these real applications, our technique produces conservative ap-
proximations yet avoids noticeable underestimations of the WCET. These resultsdemonstrate the soundness of the method for real-world products. If a user feels thatthe benchmark tests reflect the properties of future applications more than the three ker-nels presented here, then such a user can add a margin (of say 10%) to the approximatedWCET.
Providing hints to compute the WCET for these tasks requires little effort. Only a
few loops require hints to specify the bound of the trip count, and a simple recursivefunction was rewritten into a loop construct (see Table 8, which lists the number ofsource lines). For example, to profile the Hexaglide kernel, we only needed to specifytwo maximal loop iteration bounds and measure the length of four driver calls:
Table 8. Source code changes.
Application Hints Method length Code sizeLaserPointer
Concluding Remarks
This paper describes an approach to compute a realistic value for the worst-case exe-cution time for tasks that are executed on a system with preemptive multitasking anda dynamic constellation of running processes. Our approach combines compile-time
analysis with statistical performance data gathered by the hardware assists found in astate-of-the-art microprocessor. When the compiler is unable to automatically deter-mine the trip count of a loop, or the path length of a method invocation, the user mustprovide a hint. These hints are used to determine the longest path of the task, but thecompiler inserts code to monitor during execution in a real-time environment that thehints are accurate.
Our tool computes good approximations of the WCET of real-time processes of
complex mechatronic applications in an unpredictable scheduling environment, avoid-ing the significant effort needed to obtain a similar value by trial-and-error experimenta-tion or similar techniques. This solution is already employed in our robotics laboratoryto compute the WCET of real products under development. Two features helped to gainuser acceptance: this technique has minimal impact on the development time, and onlyminimal effort (occasional hints) is expected from the programmer.
Using a real processor (PowerPC 604e) for our experiments exposed us to shortcom-
ings of its implementation: The hardware performance monitor provides only coarse-grain information, and the event model supported by these hardware components is notsupportive of the execution of high-level language programs. Our tool must apply anumber of approximations to map information gathered by the hardware monitors intovalues that can be used to compute the WCET. Although we succeed in producing ausable value, better hardware assists would simplify this part of our tool. If advancedhigh-performance consumer processors are to play a role in real-time systems, furtherimprovements in the monitoring hardware assists are essential.
The principal benefit of the techniques described here is that they allow us to com-
pute a good WCET approximation of a given program for a system that uses preemptivescheduling on an set of processes unknown at the analysis time and runs on a modernhigh-performance processor. Minimal user interaction makes this toolset suitable forapplication experts. This tool demonstrates that a few simple ideas, combined with agood model of the processor architecture, provide the basis of an effective system. Acknowledgments
We would like to thank Richard H¨uppi for supplying the sources of LaserPointer con-troller software, and Marcel Honegger for supplying the sources of the Hexaglide andRobojet software. References
[1] P. Altenbernd. On the false path problem in hard real-time programs. In Proc. 8th Euromi-cro Workshop on Real-Time Systems, pages 102–107, L’Aquila, Italy, June 1996.
[2] R. Brega. A real-time operating system designed for predictability and run-time safety.
In Proc. 4th Int. Conf. Motion and Vibration Control (MOVIC), pages 379–384, Zurich,Switzerland, August 1998. ETH Zurich.
[3] J. Busquets-Mataix and A. Wellings. Adding instruction cache effect to schedulability
analysis of preemptive real-time systems. In Proc. 8th Euromicro Workshop on Real-TimeSystems, pages 271–276, L’Aquila, June 1996.
[4] C.-S. Cheng, J. Stankovic, and K. Ramamritham. Scheduling algorithms for hard real-time
systems–A brief survey. In J. Stankovic and K. Ramamritham, editors, Tutorial on HardReal-Time Systems, pages 150–173. IEEE Computer Society Press, 1988.
[5] J. Dean, J. Hicks, C. Waldspurger, W. Weihl, and G. Chrysos. ProfileMe: Hardware support
for instruction-level profiling on out-of-order processors. In Proc. 30th Annual IEEE/ACMInt. Symp. on Microarchitecture (MICRO-97), pages 292–302, Los Alamitos, CA, Decem-ber 1997. IEEE Computer Society.
[6] D. Diez and S. Vestli. D’nia an object oriented real-time system. Real-Time Magazine,
[7] P. Emma. Understanding some simple processor-performance limits. IBM J. Research andDevelopment, 43(3):215–231, 1997.
[8] C. Healy, R. Arnold, F. Mueller, D. Whalley, and M. Harmon. Bounding pipeline and
instruction cache performance. IEEE Trans. Computers, 48(1):53–70, January 1999.
[9] C. Healy, M. Sj¨odin, V. Rustagi, and D. Whalley. Bounding loop iterations for timing
analysis. In Proc. 4th Real-Time Technology and Applications Symp., pages 12–21, Denver,Colorado, June 1998.
[10] C. Healy and D. Whalley. Tighter timing predictions by automatic detection and exploita-
tion of value-dependent constraints. In Proc. 5th Real-Time Technology and ApplicationsSymp., pages 79–88, Vancouver, Canada, June 1999.
[11] M. Honegger, R. Brega, and G. Schweitzer. Application of a nonlinear adaptive controller
to a 6 dof parallel manipulator. In Proc. IEEE Int. Conf. Robotics and Automation, pages1930–1935, San Francisco CA, April 2000. IEEE.
[12] IBM Microelectronic Division and Motorola Inc. PowerPC 604/604e RISC Microprocessor
[13] S.-K. Kim, S. Min, and Ha R. Efficient worst case timing analysis of data caching. In
Proc. 2nd 1996 IEEE Real-Time Technology and Applications Symposium, pages 230–240,Boston, MA, June 1996. IEEE.
[14] D. Kirk. SMART (Strategic Memory Allocation for Real-Time) cache design. In Proc.10th IEEE Real-Time Systems Symp., pages 229–239, Santa Monica, California, December1989. IEEE.
[15] E. Kligerman and A. Stoyenko. Real-time Euclid: A language for reliable real-time sys-
tems. IEEE Trans. on Software Eng., 12(9):941–949, September 1986.
[16] L. Ko, C. Healy, E. Ratliff, R. Arnold, D. Whalley, and M. Harmon. Supporting the spec-
ification and analysis of timing constraints. In Proc. 2nd IEEE Real-Time Technology andApplications Symp., pages 170–178, Boston, MA, June 1996. IEEE.
[17] C.-G. Lee, J. Hahn, Y.-M. Seo, S. Min, R. Ha, S. Hong, C. Park, M. Lee, and C. Kim.
Analysis of cache-related preemption delay in fixed-priority preemptive scheduling. InProc. 17th IEEE Real-Time Systems Symp., pages 264–274, Washington, D.C., December1996. IEEE.
[18] F. Levine and C. Roth. A programmer’s view of performance monitoring in the PowerPC
microprocessor. IBM J. Research and Development, 41(3):345–356, May 1997.
[19] Y.-T. Li, S. Malik, and A. Wolfe. Cache modeling for real-time software: Beyond directed
mapped instructions caches. In Proc. 17th IEEE Real-Time Systems Symp., pages 254–263,Washington, D.C., December 1996. IEEE.
[20] J. Liedtke, H. H¨artig, and M. Hohmuth. OS-controlled cache predictability for real-time
systems. In Proc. 3rd IEEE Real-Time Technology and Applications Symp., Montreal,Canada, June 1997. IEEE.
[21] S.-S. Lim, Y. Bae, G. Jang, B.-D. Rhee, S. Min, C. Park, H. Shin, K. Park, S.-M. Moon,
and C.-S. Kim. An accurate worst case timing analysis for RISC processors. IEEE Trans. on Software Eng., 21(7):593–604, July 1995.
[22] S.-S. Lim, J. Han, J. Kim, and S. Min. A worst case timing analysis technique for multiple-
issue machines. In Proc. 19th IEEE Real-Time Systems Symp., pages 334–345, Madrid,Spain, December 1998. IEEE.
[23] A. Mok and G. Liu. Efficient runtime monitoring of timing constraints. In Proc. 3rd Real-Time Technology and Applications Symp., Montreal, Canada, June 1997.
[24] H. M¨ossenb¨ock and N. Wirth. The programming language Oberon-2. Structured Program-
[25] S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann Pub-
[26] F. M¨uller and J. Wegener. A comparison of static analysis and evolutionary testing for the
verification of timing constraints. In Proc. 19th Real Time Technology and ApplicationsSymp., pages 179–188, Madrid, Spain, June 1998. IEEE.
[27] C. Park. Predicting program execution times by analyzing static and dynamic program
paths. Real-Time Systems, (5):31–62, 1993.
[28] P. Puschner and C. Koza. Calculating the maximum execution time of real-time programs. J. Real-Time Systems, 1(2):160–176, September 1989.
[29] P. Puschner and R. Nossal. Testing the results of static worst-case execution-time analysis.
In Proc. 19th Real-Time Systems Symp., pages 134–143, Madrid, Spain, December 1998. IEEE.
[30] P. Puschner and A. Vrchoticky. Problems in static worst-case execution time analysis.
In 9. ITG/GI-Fachtagung Messung, Modellierung und Bewertung von Rechen- und Kom-munikationssystemen, Kurzbeitr¨age und Toolbeschreibungen, pages 18–25, Freiberg, Ger-many, September 1997.
[31] F. Stappert and P. Altenbernd. Complete worst-case execution time analysis of straight-line
hard real-time programs. J. System Architecture, 46(4):339–335, April 2000.
[32] H. Theiling and C. Ferdinand. Combining abstract interpretation and ILP for microarchi-
tecture modelling and program path analysis. In Proc. 19th IEEE Real-Time Systems Symp.,pages 144–153, Madrid, Spain, December 1998. IEEE.
[33] N. Wirth and J. Gutknecht. Project Oberon — The Design of an Operating System andCompiler. ACM Press, New York, 1992.
[34] N. Zhang, A. Burns, and M. Nicholson. Pipelined processors and worst case execution
times. Real-Time Systems, 5(4):319–343, October 1993. Glossary
CP I Cycles per instruction: the total number of cycles needed to completely execute
execunit The time, in cycles, needed by the execution unit to handle the instruction
F CE Finite cache effect: accounts, in cycles, for the effects of the memory hierarchy
idleunit The average execution unit idle time per cycle, see Sect. 6.4. loadunit The number of instructions handled by the corresponding unit per cycle, see
missload The percentage of cache load misses. missstore The percentage of cache store misses. Nq The mean number of instructions waiting in the execution unit’s reservation station,
pdep The reported number of dependences per cycle in a given reservation station,
pevent The probability of a stall in the four-instruction dispatch queue, see (9). parallelism The number of instructions that are simultaneously in the corresponding
penaltyload The average number of additional cycles needed to fetch the data from the
queue size The length of the dispatch queue, see (9)stalldispatch The time, in cycles, a given instruction stalls in the dispatch unit, see (9). stallpipeline The time, in cycles, a given instruction stalls outside the corresponding
stallunit The time, in cycles, a given instruction stalls in the corresponding execution
utilunit The fraction of instructions handled by the corresponding unit, see (14). ρ The unit utilization factor or arrival/service ratio, see (6).
Instituto de Acção Social das Forçs Armadas Assistênca na Doença aos Militares REGIME DE LIVRE ESCOLHA REGRAS NA APLICAÇÃO DAS TABELAS DE COMPARTICIPAÇÃO 1. NOTA INTRODUTÓRIA ANOTAÇÕES GENÉRICAS 1) Os cuidados, actos e os apoios que beneficiam de comparticipação da ADM são identificados através de um código a que corresponde uma designação. 2) As tabelas
L’EFFIPRED DANS LE TRAITEMENT DE L’ASTHME I/ DEFINITION Chez le jeune enfant, l’asthme est souvent L’ «Asthme » est un terme grec signifiant des reconnu trop tardivement : la survenue de crises spontanées de dyspnée sibilante 3 épisodes dans l’année de « sifflantes ou de bronchites asthmatiques » doit L’asthme est un syndrome défini cliniquement