Tree - source-git/boost - CentOS Git server

source-git / boost

Blame libs/statechart/doc/performance.html

Blob History Raw

Packit	58578d
Packit	58578d
Packit	58578d	`<html>`
Packit	58578d	`<head>`
Packit	58578d	`<meta http-equiv="Content-Language" content="en-us">`
Packit	58578d	`<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">`
Packit	58578d	`<meta name="GENERATOR" content="Microsoft FrontPage 6.0">`
Packit	58578d	`<meta name="ProgId" content="FrontPage.Editor.Document">`
Packit	58578d	`<link rel="stylesheet" type="text/css" href="../../../boost.css">`
Packit	58578d
Packit	58578d	`<title>The Boost Statechart Library - Performance</title>`
Packit	58578d	`</head>`
Packit	58578d
Packit	58578d	`<body link="#0000FF" vlink="#800080">`
Packit	58578d
Packit	58578d	`"header">`
Packit	58578d
Packit	58578d
Packit	58578d
Packit	58578d	`"../../../boost.png" border="0" width="277" height="86">`
Packit	58578d
Packit	58578d
Packit	58578d
Packit	58578d	`The Boost Statechart Library`
Packit	58578d
Packit	58578d	`Performance`
Packit	58578d
Packit	58578d
Packit	58578d
Packit	58578d
Packit	58578d
Packit	58578d
Packit	58578d	`Speed versus scalability`
Packit	58578d	`tradeoffs`
Packit	58578d
Packit	58578d	`Memory management`
Packit	58578d	`customization`
Packit	58578d
Packit	58578d	`RTTI customization`
Packit	58578d
Packit	58578d	`Resource usage`
Packit	58578d
Packit	58578d
Packit	58578d
Packit	58578d	`"SpeedVersusScalabilityTradeoffs">Speed versus scalability`
Packit	58578d	`tradeoffs`
Packit	58578d
Packit	58578d	`Quite a bit of effort has gone into making the library fast for small`
Packit	58578d	`simple machines and scaleable at the same time (this applies only to`
Packit	58578d	`state_machine<>, there still is some room for optimizing`
Packit	58578d	`fifo_scheduler<>, especially for multi-threaded builds).`
Packit	58578d	`While I believe it should perform reasonably in most applications, the`
Packit	58578d	`scalability does not come for free. Small, carefully handcrafted state`
Packit	58578d	`machines will thus easily outperform equivalent Boost.Statechart machines.`
Packit	58578d	`To get a picture of how big the gap is, I implemented a simple benchmark in`
Packit	58578d	`the BitMachine example. The Handcrafted example is a handcrafted variant of`
Packit	58578d	`the 1-bit-BitMachine implementing the same benchmark.`
Packit	58578d
Packit	58578d	`I tried to create a fair but somewhat unrealistic worst-case`
Packit	58578d	`scenario:`
Packit	58578d
Packit	58578d
Packit	58578d	`For both machines exactly one object of the only event is allocated`
Packit	58578d	`before starting the test. This same object is then sent to the machines`
Packit	58578d	`over and over`
Packit	58578d
Packit	58578d	`The Handcrafted machine employs GOF-visitor double dispatch. The`
Packit	58578d	`states are preallocated so that event dispatch & transition amounts`
Packit	58578d	`to nothing more than two virtual calls and one pointer assignment`
Packit	58578d
Packit	58578d
Packit	58578d	`The Benchmarks - running on a 3.2GHz Intel Pentium 4 - produced the`
Packit	58578d	`following dispatch and transition times per event:`
Packit	58578d
Packit	58578d
Packit	58578d	`Handcrafted:`
Packit	58578d
Packit	58578d
Packit	58578d	`MSVC7.1: 10ns`
Packit	58578d
Packit	58578d	`GCC3.4.2: 20ns`
Packit	58578d
Packit	58578d	`Intel9.0: 20ns`
Packit	58578d
Packit	58578d
Packit	58578d
Packit	58578d	`1-bit-BitMachine with customized memory management:`
Packit	58578d
Packit	58578d
Packit	58578d	`MSVC7.1: 150ns`
Packit	58578d
Packit	58578d	`GCC3.4.2: 320ns`
Packit	58578d
Packit	58578d	`Intel9.0: 170ns`
Packit	58578d
Packit	58578d
Packit	58578d
Packit	58578d
Packit	58578d	`Although this is a big difference I still think it will not be`
Packit	58578d	`noticeable in most real-world applications. No matter whether an`
Packit	58578d	`application uses handcrafted or Boost.Statechart machines it will...`
Packit	58578d
Packit	58578d
Packit	58578d	`almost never run into a situation where a state machine is swamped`
Packit	58578d	`with as many events as in the benchmarks. A state machine will almost`
Packit	58578d	`always spend a good deal of time waiting for events (which typically come`
Packit	58578d	`from a human operator, from machinery or from electronic devices over`
Packit	58578d	`often comparatively slow I/O channels). Parsers are just about the only`
Packit	58578d	`application of FSMs where this is not the case. However, parser FSMs are`
Packit	58578d	`usually not directly specified on the state machine level but on a higher`
Packit	58578d	`one that is better suited for the task. Examples of such higher levels`
Packit	58578d	`are: Boost.Regex, Boost.Spirit, XML Schemas, etc. Moreover, the nature of`
Packit	58578d	`parsers allows for a number of optimizations that are not possible in a`
Packit	58578d	`general-purpose FSM framework.`
Packit	58578d	`Bottom line: While it is possible to implement a parser with this`
Packit	58578d	`library, it is almost never advisable to do so because other approaches`
Packit	58578d	`lead to better performing and more expressive code`
Packit	58578d
Packit	58578d	`often run state machines in their own threads. This adds considerable`
Packit	58578d	`locking and thread-switching overhead. Performance tests with the`
Packit	58578d	`PingPong example, where two asynchronous state machines exchange events,`
Packit	58578d	`gave the following times to process one event and perform the resulting`
Packit	58578d	`in-state reaction (using the library with`
Packit	58578d	`boost::fast_pool_allocator<>):`
Packit	58578d
Packit	58578d
Packit	58578d	`Single-threaded (no locking and waiting): 840ns / 840ns`
Packit	58578d
Packit	58578d	`Multi-threaded with one thread (the scheduler uses mutex locking`
Packit	58578d	`but never has to wait for events): 6500ns / 4800ns`
Packit	58578d
Packit	58578d	`Multi-threaded with two threads (both schedulers use mutex`
Packit	58578d	`locking and exactly one always waits for an event): 14000ns /`
Packit	58578d	`7000ns`
Packit	58578d
Packit	58578d
Packit	58578d	`As mentioned above, there definitely is some room to improve the`
Packit	58578d	`timings for the asynchronous machines. Moreover, these are very crude`
Packit	58578d	`benchmarks, designed to show the overhead of locking and thread context`
Packit	58578d	`switching. The overhead in a real-world application will typically be`
Packit	58578d	`smaller and other operating systems can certainly do better in this`
Packit	58578d	`area. However, I strongly believe that on most platforms the threading`
Packit	58578d	`overhead is usually larger than the time that Boost.Statechart spends`
Packit	58578d	`for event dispatch and transition. Handcrafted machines will inevitably`
Packit	58578d	`have the same overhead, making raw single-threaded dispatch and`
Packit	58578d	`transition speed much less important`
Packit	58578d
Packit	58578d
Packit	58578d	`almost always allocate events with new and destroy them`
Packit	58578d	`after consumption. This will add a few cycles, even if event memory`
Packit	58578d	`management is customized`
Packit	58578d
Packit	58578d	`often use state machines that employ orthogonal states and other`
Packit	58578d	`advanced features. This forces the handcrafted machines to use a more`
Packit	58578d	`adequate and more time-consuming book-keeping`
Packit	58578d
Packit	58578d
Packit	58578d	`Therefore, in real-world applications event dispatch and transition not`
Packit	58578d	`normally constitutes a bottleneck and the relative gap between handcrafted`
Packit	58578d	`and Boost.Statechart machines also becomes much smaller than in the`
Packit	58578d	`worst-case scenario.`
Packit	58578d
Packit	58578d	`Detailed performance data`
Packit	58578d
Packit	58578d	`In an effort to identify the main performance bottleneck, the example`
Packit	58578d	`program "Performance" has been written. It measures the time that is spent`
Packit	58578d	`to process one event in different BitMachine variants. In contrast to the`
Packit	58578d	`BitMachine example, which uses only transitions, Performance uses a varying`
Packit	58578d	`number of in-state reactions together with transitions. The only difference`
Packit	58578d	`between in-state-reactions and transitions is that the former neither enter`
Packit	58578d	`nor exit any states. Apart from that, the same amount of code needs to be`
Packit	58578d	`run to dispatch an event and execute the resulting action.`
Packit	58578d
Packit	58578d	`The following diagrams show the average time the library spends to`
Packit	58578d	`process one event, depending on the percentage of in-state reactions`
Packit	58578d	`employed. 0% means that all triggered reactions are transitions. 100% means`
Packit	58578d	`that all triggered reactions are in-state reactions. I draw the following`
Packit	58578d	`conclusions from these measurements:`
Packit	58578d
Packit	58578d
Packit	58578d	`The fairly linear course of the curves suggests that the measurements`
Packit	58578d	`with a 100% in-state reaction ratio are accurate and not merely a product`
Packit	58578d	`of optimizations in the compiler. Such optimizations might have been`
Packit	58578d	`possible due to the fact that in the 100% case it is known at`
Packit	58578d	`compile-time that the current state will never change`
Packit	58578d
Packit	58578d	`The data points with 100% in-state reaction ratio and speed optimized`
Packit	58578d	`RTTI show that modern compilers seem to inline the complex-looking`
Packit	58578d	`dispatch code so aggressively that dispatch is reduced to little more`
Packit	58578d	`than it actually is, one virtual function call followed by a linear`
Packit	58578d	`search for a suitable reaction. For instance, in the case of the 1-bit`
Packit	58578d	`Bitmachine, Intel9.0 seems to produce dispatch code that is equally`
Packit	58578d	`efficient like the two virtual function calls in the Handcrafted`
Packit	58578d	`machine`
Packit	58578d
Packit	58578d	`On all compilers and in all variants the time spent in event dispatch`
Packit	58578d	`is dwarfed by the time spent to exit the current state and enter the`
Packit	58578d	`target state. It is worth noting that BitMachine is a flat and`
Packit	58578d	`non-orthogonal state machine, representing a close-to-worst case.`
Packit	58578d	`Real-world machines will often exit and enter multiple states during a`
Packit	58578d	`transition, what further dwarfs pure dispatch time. This makes the`
Packit	58578d	`implementation of constant-time dispatch (as requested by a few people`
Packit	58578d	`during formal review) an undertaking with little merit. Instead, the`
Packit	58578d	`future optimization effort will concentrate on state-entry and`
Packit	58578d	`state-exit`
Packit	58578d
Packit	58578d	`Intel9.0 seems to have problems to optimize/inline code as soon as`
Packit	58578d	`the amount of code grows over a certain threshold. Unlike with the other`
Packit	58578d	`two compilers, I needed to compile the tests for the 1, 2, 3 and 4-bit`
Packit	58578d	`BitMachine into separate executables to get good performance. Even then`
Packit	58578d	`was the performance overly bad for the 4-bit BitMachine. It was much`
Packit	58578d	`worse when I compiled all 4 tests into the same executable. This surely`
Packit	58578d	`looks like a bug in the compiler`
Packit	58578d
Packit	58578d
Packit	58578d	`Out of the box`
Packit	58578d
Packit	58578d	`The library is used as is, without any optimizations/modifications.`
Packit	58578d
Packit	58578d
Packit	58578d	`width="371" height="284">`
Packit	58578d	`"PerformanceNormal2.gif" border="0" width="371" height="284">`
Packit	58578d	`"PerformanceNormal3" src="PerformanceNormal3.gif" border="0" width="371"`
Packit	58578d	`height="284">`
Packit	58578d	`border="0" width="371" height="284">`
Packit	58578d
Packit	58578d	`Native RTTI`
Packit	58578d
Packit	58578d	`The library is used with`
Packit	58578d	`"configuration.html#ApplicationDefinedMacros">BOOST_STATECHART_USE_NATIVE_RTTI`
Packit	58578d	`defined.`
Packit	58578d
Packit	58578d
Packit	58578d	`width="371" height="284">`
Packit	58578d	`"PerformanceNative2.gif" border="0" width="371" height="284">`
Packit	58578d	`"PerformanceNative3" src="PerformanceNative3.gif" border="0" width="371"`
Packit	58578d	`height="284">`
Packit	58578d	`border="0" width="371" height="284">`
Packit	58578d
Packit	58578d	`Customized memory-management`
Packit	58578d
Packit	58578d	`The library is used with customized memory management`
Packit	58578d	`(boost::fast_pool_allocator).`
Packit	58578d
Packit	58578d
Packit	58578d	`width="371" height="284">`
Packit	58578d	`"PerformanceCustom2.gif" border="0" width="371" height="284">`
Packit	58578d	`"PerformanceCustom3" src="PerformanceCustom3.gif" border="0" width="371"`
Packit	58578d	`height="284">`
Packit	58578d	`border="0" width="371" height="284">`
Packit	58578d
Packit	58578d	`Double dispatch`
Packit	58578d
Packit	58578d	`At the heart of every state machine lies an implementation of double`
Packit	58578d	`dispatch. This is due to the fact that the incoming event and the`
Packit	58578d	`active state define exactly which`
Packit	58578d	`"definitions.html#Reaction">reaction the state machine will produce.`
Packit	58578d	`For each event dispatch, one virtual call is followed by a linear search`
Packit	58578d	`for the appropriate reaction, using one RTTI comparison per reaction. The`
Packit	58578d	`following alternatives were considered but rejected:`
Packit	58578d
Packit	58578d
Packit	58578d
Packit	58578d	`"http://www.objectmentor.com/resources/articles/acv.pdf">Acyclic`
Packit	58578d	`visitor: This double-dispatch variant satisfies all scalability`
Packit	58578d	`requirements but performs badly due to costly inheritance tree`
Packit	58578d	`cross-casts. Moreover, a state must store one v-pointer for each`
Packit	58578d	`reaction what slows down construction and makes memory management`
Packit	58578d	`customization inefficient. In addition, C++ RTTI must inevitably be`
Packit	58578d	`turned on, with negative effects on executable size. Boost.Statechart`
Packit	58578d	`originally employed acyclic visitor and was about 4 times slower than it`
Packit	58578d	`is now (MSVC7.1 on Intel Pentium M). The dispatch speed might be better`
Packit	58578d	`on other platforms but the other negative effects will remain`
Packit	58578d
Packit	58578d	`GOF`
Packit	58578d	`Visitor: The GOF Visitor pattern inevitably makes the whole machine`
Packit	58578d	`depend upon all events. That is, whenever a new event is added there is`
Packit	58578d	`no way around recompiling the whole state machine. This is contrary to`
Packit	58578d	`the scalability requirements`
Packit	58578d
Packit	58578d	`Single two-dimensional array of function pointers: To satisfy`
Packit	58578d	`requirement 6, it should be possible to spread a single state machine`
Packit	58578d	`over several translation units. This however means that the dispatch`
Packit	58578d	`table must be filled at runtime and the different translation units must`
Packit	58578d	`somehow make themselves "known", so that their part of the state machine`
Packit	58578d	`can be added to the table. There simply is no way to do this`
Packit	58578d	`automatically and portably. The only portable way that a state`
Packit	58578d	`machine distributed over several translation units could employ`
Packit	58578d	`table-based double dispatch relies on the user. The programmer(s) would`
Packit	58578d	`somehow have to manually tie together the various pieces of the`
Packit	58578d	`state machine. Not only does this scale badly but is also quite`
Packit	58578d	`error-prone`
Packit	58578d
Packit	58578d
Packit	58578d
Packit	58578d	`"MemoryManagementCustomization">Memory management customization`
Packit	58578d
Packit	58578d	`Out of the box, everything (event objects, state objects, internal data,`
Packit	58578d	`etc.) is allocated through std::allocator< void > (the`
Packit	58578d	`default for the Allocator template parameter). This should be satisfactory`
Packit	58578d	`for applications meeting the following prerequisites:`
Packit	58578d
Packit	58578d
Packit	58578d	`There are no deterministic reaction time (hard real-time)`
Packit	58578d	`requirements`
Packit	58578d
Packit	58578d	`The application will never run long enough for heap fragmentation to`
Packit	58578d	`become a problem. This is of course an issue for all long running`
Packit	58578d	`programs not only the ones employing this library. However, it should be`
Packit	58578d	`noted that fragmentation problems could show up earlier than with`
Packit	58578d	`traditional FSM frameworks`
Packit	58578d
Packit	58578d
Packit	58578d	`Should an application not meet these prerequisites, Boost.Statechart's`
Packit	58578d	`memory management can be customized as follows:`
Packit	58578d
Packit	58578d
Packit	58578d	`By passing a model of the standard Allocator concept to the class`
Packit	58578d	`templates that support a corresponding parameter`
Packit	58578d	`(event<>, state_machine<>,`
Packit	58578d	`asynchronous_state_machine<>,`
Packit	58578d	`fifo_scheduler<>, fifo_worker<>).`
Packit	58578d	`This redirects all allocations to the passed custom allocator and should`
Packit	58578d	`satisfy the needs of just about any project`
Packit	58578d
Packit	58578d	`Additionally, it is possible to separately customize`
Packit	58578d	`state memory management by overloading operator new()`
Packit	58578d	`and operator delete() for all state classes but this is`
Packit	58578d	`probably only useful under rare circumstances`
Packit	58578d
Packit	58578d
Packit	58578d	`RTTI`
Packit	58578d	`customization`
Packit	58578d
Packit	58578d	`RTTI is used for event dispatch and`
Packit	58578d	`state_downcast<>(). Currently, there are exactly two`
Packit	58578d	`options:`
Packit	58578d
Packit	58578d
Packit	58578d	`By default, a speed-optimized internal implementation is`
Packit	58578d	`employed`
Packit	58578d
Packit	58578d	`The library can be instructed to use native C++ RTTI instead by`
Packit	58578d	`defining`
Packit	58578d	`"configuration.html#ApplicationDefinedMacros">BOOST_STATECHART_USE_NATIVE_RTTI`
Packit	58578d
Packit	58578d
Packit	58578d	`There are 2 reasons to favor 2:`
Packit	58578d
Packit	58578d
Packit	58578d	`When a state machine (or parts of it) are compiled into a DLL,`
Packit	58578d	`problems could arise from the use of the internal RTTI mechanism (see the`
Packit	58578d	`FAQ item "How can I compile a state machine into a`
Packit	58578d	`dynamic link library (DLL)?"). Using option 2 is one way to work`
Packit	58578d	`around such problems (on some platforms, it seems to be the only`
Packit	58578d	`way)`
Packit	58578d
Packit	58578d	`State and event objects need to store one pointer less, meaning that`
Packit	58578d	`in the best case the memory footprint of a state machine object could`
Packit	58578d	`shrink by 15% (an empty event is typically 30% smaller, what can be an`
Packit	58578d	`advantage when there are bursts of events rather than a steady flow).`
Packit	58578d	`However, on most platforms executable size grows when C++ RTTI is turned`
Packit	58578d	`on. So, given the small per machine object savings, this only makes sense`
Packit	58578d	`in applications where both of the following conditions hold:`
Packit	58578d
Packit	58578d
Packit	58578d	`Event dispatch will never become a bottleneck`
Packit	58578d
Packit	58578d	`There is a need to reduce the memory allocated at runtime (at the`
Packit	58578d	`cost of a larger executable)`
Packit	58578d
Packit	58578d
Packit	58578d	`Obvious candidates are embedded systems where the executable resides`
Packit	58578d	`in ROM. Other candidates are applications running a large number of`
Packit	58578d	`identical state machines where this measure could even reduce the`
Packit	58578d	`overall memory footprint`
Packit	58578d
Packit	58578d
Packit	58578d
Packit	58578d	`Resource usage`
Packit	58578d
Packit	58578d	`Memory`
Packit	58578d
Packit	58578d	`On a 32-bit box, one empty active state typically needs less than 50`
Packit	58578d	`bytes of memory. Even very complex machines will usually have less`
Packit	58578d	`than 20 simultaneously active states so just about every machine should run`
Packit	58578d	`with less than one kilobyte of memory (not counting event queues).`
Packit	58578d	`Obviously, the per-machine memory footprint is offset by whatever`
Packit	58578d	`state-local members the user adds.`
Packit	58578d
Packit	58578d	`Processor cycles`
Packit	58578d
Packit	58578d	`The following ranking should give a rough picture of what feature will`
Packit	58578d	`consume how many cycles:`
Packit	58578d
Packit	58578d
Packit	58578d	`state_cast<>(): By far the most cycle-consuming`
Packit	58578d	`feature. Searches linearly for a suitable state, using one`
Packit	58578d	`dynamic_cast per visited state`
Packit	58578d
Packit	58578d	`State entry and exit: Profiling of the fully optimized`
Packit	58578d	`1-bit-BitMachine suggested that roughly 3 quarters of the total event`
Packit	58578d	`processing time is spent destructing the exited state and constructing`
Packit	58578d	`the entered state. Obviously, transitions where the`
Packit	58578d	`"definitions.html#InnermostCommonContext">innermost common context is`
Packit	58578d	`"far" from the leaf states and/or with lots of orthogonal states can`
Packit	58578d	`easily cause the destruction and construction of quite a few states`
Packit	58578d	`leading to significant amounts of time spent for a transition`
Packit	58578d
Packit	58578d	`state_downcast<>(): Searches linearly for the`
Packit	58578d	`requested state, using one virtual call and one RTTI comparison per`
Packit	58578d	`visited state`
Packit	58578d
Packit	58578d	`Deep history: For all innermost states inside a state passing either`
Packit	58578d	`has_deep_history or has_full_history to its`
Packit	58578d	`state base class, a binary search through the (usually small) history map`
Packit	58578d	`must be performed on each exit. History slot allocation is performed`
Packit	58578d	`exactly once, at first exit`
Packit	58578d
Packit	58578d	`Shallow history: For all direct inner states of a state passing`
Packit	58578d	`either has_shallow_history or has_full_history`
Packit	58578d	`to its state base class, a binary search through the (usually small)`
Packit	58578d	`history map must be performed on each exit. History slot allocation is`
Packit	58578d	`performed exactly once, at first exit`
Packit	58578d
Packit	58578d	`Event dispatch: One virtual call followed by a linear search for a`
Packit	58578d	`suitable reaction, using one RTTI`
Packit	58578d	`comparison per visited reaction`
Packit	58578d
Packit	58578d	`Orthogonal states: One additional virtual call for each exited state`
Packit	58578d	`if there is more than one active leaf state before a transition.`
Packit	58578d	`It should also be noted that the worst-case event dispatch time is`
Packit	58578d	`multiplied in the presence of orthogonal states. For example, if two`
Packit	58578d	`orthogonal leaf states are added to a given state configuration, the`
Packit	58578d	`worst-case time is tripled`
Packit	58578d
Packit	58578d
Packit	58578d
Packit	58578d
Packit	58578d	`"../../../doc/images/valid-html401.png" alt="Valid HTML 4.01 Transitional"`
Packit	58578d	`height="31" width="88">`
Packit	58578d
Packit	58578d	`Revised`
Packit	58578d	`03 December, 2006`
Packit	58578d
Packit	58578d	`Copyright © 2003-2006`
Packit	58578d	`Andreas Huber Dönni`
Packit	58578d
Packit	58578d	`Distributed under the Boost Software License, Version 1.0. (See`
Packit	58578d	`accompanying file LICENSE_1_0.txt or`
Packit	58578d	`copy at`
Packit	58578d	`"http://www.boost.org/LICENSE_1_0.txt">http://www.boost.org/LICENSE_1_0.txt)`
Packit	58578d	`</body>`
Packit	58578d	`</html>`

source-git / boost

Source Code

Blame libs/statechart/doc/performance.html

The Boost Statechart Library

Performance

Detailed performance data

Out of the box

Native RTTI

Customized memory-management

Double dispatch

RTTI

Resource usage

Memory

Processor cycles