Pipelined processors do not like conditional branches because when conditional branch is entering in the pipeline, the processor does not know yet what will be the condition outcome but it needs to continue to fill the pipeline with new instructions. In that situation the processor will try to predict what will be the condition outcome and start filling the pipeline from its prediction. This is very costly performance wise when the processor mispredicts the condition outcome. When that happens, the processor has to flush the pipeline content and start filling it again from the branch point.
The conditional branch prediction module has 2 components. There is the branch outcome cache that will help the processor to remember what was the outcome of branches that it has encountered recently and a static branch prediction algorithm.
The static branch algorithm is very simple:
If you read the Intel processor optimization guide, few simple guidelines are easy to apply:
1. Eliminate branches
Some conditional branches are simply useless as whether they are taken or not, it will not make any differences on the end result. 2 examples of these are:
if( i != 0 ) i = 0;
or
if( ptr != NULL ) { delete ptr; ptr = NULL; }
should be changed to:
i = 0;
and
delete ptr;
ptr = NULL;
It is faster to reassign the same value to a variable than conditionally assign it. The operator delete
and the function free()
already contain safeguards against NULL
pointers and it should be quite exceptional that they are called anyway with NULL
pointers.
2. Refactor if statement conditions to be true most of the time.
Here is an example:
if(error) { // exceptional error handling } else { // normal operation }
would be faster written like:
if(!error) { // normal operation } else { // exceptional error handling }
To conclude, these guidelines are specifically for Intel processors but since the static branch prediction algorithm used by Intel processors is widely used by other pipelined processors (PowerPC is one of them), applying them on other platforms should be beneficial as well. Also, if you are looking for other optimization techniques, you can consult these sources.
I have done pthread programming for a while and read a couple of books on the topic too but nothing have prepared me to the undefined behavior that I have discovered the hard way.
The POSIX standard states that when the main thread exits the process, the remaining threads will be automatically terminated. What the standard does not specify is in which order the different process cleanup will occur hence if you leave the process without terminating yourself the threads, you will get undefined behavior. In my case, it was a segmentation fault.
The undefined behavior is real especially if this is a C++ program because part of the termination cleanup, there is the destruction of the program global objects. If your threads are still running and happen to use global objects, this is a sure way to a segmentation fault.
Even in fixing the problem, I have learned again something new about pthreads programming. To fix my problem, I just sent to myself the SIGINT signal that initiates the correct termination. In my first attempt, I have replaced my exit()
function call with raise(SIGINT)
which I have read in my UNIX programming textbook was the equivalent to kill(getpid(), SIGINT)
but by doing so, the signal was never received. After some investigation, I have found in a book, that in a multithread program, raise(SIGINT)
was equivalent to pthread_kill(pthread_self(),SIGINT)
. In my case, calling raise was doing nothing because the signal was blocked in the thread from where the call was done. Everything finally worked fine when I replaced the raise call with kill(getpid(), SIGINT)
.
This is an advice that can be found in many books such as:
C++ Coding Standards, Exceptional C++, More Effective C++ or The C++ Standard Library
However, despite the abundance of literature giving this advice and explaining why, a lot of people are not aware of this. So if you are one of these persons and you are someone who is spending more time browsing the web than reading books and you have found this blog, this is for you.
At first look, statements like
++i; i++;
seem equivalent but you have to remember that if i
is an object, the increment operator is implemented with a function. Seeing the form of typical operator functions implementation should explain everything:
// Prefix version T& T::operator++() { //Perform increment return *this; // Return a reference on this } // Postfix version T T::operator++(int) { T res(*this); // Make a copy of the current value ++*this; // Call the prefix version return res; // Return the value by result }
So, as you can see, most of the time the postfix increment operator is implemented in terms of the prefix one so even if you call the postfix operator, you will end up calling the prefix operator function anyway. The rest is just overhead:
operator++()
is not inlineYou'll notice that at the third point, I used 'might'. If you have a good compiler, it might be able to get rid of the construction of the return value by using a technique called the Return Value Optimization (RVO). Most compilers are able to pull off that optimization with an unnamed object (temporary object) like:
return T;
If a return statement is returning an automatic variable (local variable) declared before the return statement as it is done in the operator++(int)
function, a lot fewer compilers are able to apply the RVO. Those who are able are said to have named RVO (NRVO). Visual C++ has it only since version 2005 according to this blog entry. GCC supports it since version 3.1 according to this Boost mailing list archive entry.
To conclude, it should now be obvious that unless you absolutely need the postfix increment semantic, you should use the prefix version. This should represents 99% of the increments in a program since usually, they stand alone in their own statement and the return value is just not used at all. This is usually what happens in code using STL containers iterators. One could argue that it does not make a perceptable difference performance wise in a program. Maybe but why choosing the slowest option when both ways are equally easy to write? Herb Sutter and Andrei Alexandrescu think the same and they give the excellent advice (Don't pessimize prematurely) in their book C++ Coding Standards.
Before clearing or destructing a STL container containing dynamically allocated pointers, it is important to delete the pointers or else, you are going to leak memory and resources (See item 7 in Effective STL). The usual way to this is:
for( vector<MyClass*>::iterator it = v.begin(), e = v.end(); it != e; ++it ) { delete *it; } v.clear();
There is a small twist with sets. This code would be fine with a set using its default compare function but since sorting pointers with their values is not very useful, most of the time, such containers are configured with a custom compare function that dereference the pointers to perform its comparison. The standard does not specify which set methods use the compare function and which are not. So it is looking for trouble to free the memory and keep the pointers on it in the set and at best your set will be in an invalid state.
The easy way is to store boost::smart_ptr<MyClass*>
into the set. All you have to do is to clear the set and the smart pointers will take care of deallocating the memory. My opinion, however, is that this solution is like using a machine gun to kill a mosquito. The real purpose of smart pointers is for dealing with situations where the pointers ownership is fuzzy. If you know that it is safe without any doubt to delete the pointers in your container when you want to clear it, you should not be using smart pointers.
Another option is to use boost::ptr_set
but since I am not very familiar with this class, this is not something I will recommend.
The best solution that I have seen is simply doing this:
typedef std::set<MyClass*, CompFuncThatDerefencePointers> MySet; for( MySet::iterator it = s.begin(), e = s.end(); it != e; ) { MyClass *p = *it; s.erase(it++); delete p; }
There are 2 more details that I want to discuss. Some STL containers (such as set
and map
) erase()
method signature is
void erase(iterator it);
and some other containers (list
and vector
) have
iterator erase(iterator it);
The second signature makes sense for containers that upon insertion or removal of elements invalidates all iterators on it but I have no idea why list has its erase returning an iterator. If you think that you might switch from a list to an associative container, you might want to erase elements from the list with
l.erase(iter++);
to keep your code compatible with the associative containers interface. As I am writing, I think that the STL designers decided to make the list erase signature compatible with the vector erase
because they considered that this was the most likely switch if you had to change your container strategy.
The other important point that I wanted to mention is that my recommendation to explicitly clear the pointers rather than using boost::smart_ptr
assumes that your container is a data member of a class and that you are using the RAII idiom. That is that you are deleting the container elements memory in the class destructor. This is to be sure that your code is exception-safe. If your container is an automatic variable in a function, then this might be an appropriate situation to use boost::smart_ptr
because if you do not and an exception is thrown before your container clearing code is executed, you will leak memory.
In the book The C++ Programming language from Bjarne Stroustrup, you can read in section 5.7 (Structures):
The size of an object of a structure type is not necessarily the sum of the sizes of its members. This is because many machines require objects of certain types to be allocated on architecture dependant boundaries or handle such objects much more efficiently if they are. For example, integers are often allocated on word boundaries. On such machines, objects are said to have to be aligned properly. This leads to "holes" in the structures...You can minimize wasted space by simply ordering members by size (largest member first).
As an example, on my platform if I write:
struct MyPOD1 { bool a; int b; bool c int d; }; struct MyPOD2 { int b; int d; bool a; bool c; };
sizeof(MyPOD1)
is 16 and sizeof(MyPOD2)
is 12. which means that 4 bytes are wasted by the MyPOD1 layout. Bjarne continues by adding that you should prefer readability over memory usage optimization but in my opinion taking the habit of always placing 1 byte variables such as bools and chars at the end of a class data members declaration can hardly obfuscate the code especially if the class encapsulation is tight.
Now that the demonstration that by just reordering data members around in a class declaration can save space has been done, let me explain to you the advantages of saving few bytes. Just imagine that you have a std::vector<MyPOD2>
of 25K elements. You just saved 100KB of space to your program for free and since more elements will fit in the processor cache which should make the vector traversal faster.
I recommend taking this habit because it is easy to do and once you get used to do it, it will become an automatism and you will get the best performance from your data structures without having to stop and think about how you could optimize them before you need to.
I want you to find in this blog informations about C++ programming that I had a hard time to find in the first place on the web.
Sun | Mon | Tue | Wed | Thu | Fri | Sat |
---|---|---|---|---|---|---|
<< < | > >> | |||||
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 | 31 |