Incremental improvements

Posted by Thiago Macieira on December 12, 2009 · 20 comments

I don’t know if this is showing up for the community, but we in Qt have been dedicating a lot of effort for performance improvements in Qt. For Qt 4.5, we had a project codenamed “Falcon” whose job was to improve the graphics engines and make them perform much faster. From that project, we got the graphics engines, including the raster and OpenGL ones.

For Qt 4.6, there was a lot of work done on Graphics View. For Qt 4.7, we’re going to do some more. Where, I don’t know yet.

Among the many ideas, one I’m interested in seems very small, but may be of important benefit: removing volatile from QAtomicInt and QAtomicPointer.

Here’s what happens: QAtomicInt derives from the internal class QBasicAtomicInt, which is a struct of one member: a volatile int _q_value. Similarly, QAtomicPointer derives from QBasicAtomicPointer, which is a struct of one member: T * volatile _q_value. The idea here is to remove those two “volatile” keywords.

Before you cry foul and tell me that I’m going to break your code, let me quote the Qt documentation for these two classes:

For convenience, QAtomicPointer provides pointer comparison, cast, dereference, and assignment operators. Note that these operators are not atomic.

(emphasis is in the documentation)

With that card up my sleeve, I claim that I’m not breaking any contracts. All of the atomic operations that these two classes support (fetch-and-add, test-and-set, fetch-and-store) are implemented in assembly, which means the compiler cannot optimise them anyway. And the assembly code will not be influenced by any caching of the variable contents that the compiler may want to produce. What’s more, we also tell the compiler that we changed the value, so that it will discard its cache.

So, why are we considering this?

Well, the reason I hinted above: the compiler caching the value. The whole point of the volatile keyword is that the compiler may not cache the value. It must reload the value every time it tries to access the variable.

And if we look at any of the Qt tool classes, the non-const functions start by calling detach(), which is generally implemented like this (extracted from qlist.h):

inline void detach() { if (d->ref != 1) detach_helper(); }

That is, “if our reference count is not one (i.e., if we’re being shared), do the detaching.”

And since QAtomicInt::operator int() simply returns _q_value, which is volatile, the compiler has to reload the value every single time. Then it must actually compare that value to 1 and generate the proper branching.

What the compiler doesn’t know, is that once a container detaches, the reference count will remain 1. It can only increase from 1 in a way that is visible to the compiler: that is, either in the same thread or, if the container is a globally-visible variable, after mutex locking/unlocking.

So, if we remove the volatile keyword, the compiler is allowed to cache the value of the reference count. Once the first detaching happens, the compiler knows that the reference count is 1. It can therefore optimise out all the next checks, because it also knows that the reference count remains 1.

This would mean that our reference counting system would be a lot more efficient (hence the title of this blog). It might turn out to be the best ratio of performance gain vs effort ever. After all, it’s a one-word change in one header file.

That’s the theory anyway. I haven’t yet tested to see if the compiler really knows how to optimise this the way I expect it to.

PS: credit where credit is due: this optimisation was not my idea. It was Olivier’s. And at first I resisted, saying it would break stuff and wouldn’t work. But now I’m in favour of it. :-)

QShare(this)

No related posts.


20 comments

1 Benjamin December 12, 2009 at 1:31 am
 

Actually, the documentation of QAtomicInt says : “For convenience, QAtomicInt provides integer comparison, cast, and assignment operators. Note that a combination of these operators is not an atomic operation.”

This document has changed recently. I doubt Olivier’s change will break any code. I am all for this change as well if it gives real benefit.

2 Thiago Macieira December 12, 2009 at 11:52 am
 

Benjamin: yes, you’re right. In fact, those operators are atomic and we even rely on that fact.

This will not change: they will remain atomic.

What is not specified is the memory ordering. Those operators all use the “relaxed” memory ordering.

3 AlekSi December 12, 2009 at 12:03 pm
 

Thiago, I think you should update your Gravatar email address. ;)

4 mkretz December 12, 2009 at 12:15 pm
 

FWIW, I’ve seen gcc happily optimizing code like the following

if(a) foo();
if(a) bar();

to

if(a) { foo(); bar(); }

i.e. it reduces to just one branch. Now this probably isn’t applicable in many places for you unless detach() is called close enough together that gcc decides this optimization as beneficial.

Another thought: the compiler can only make a static optimization if it can prove that no other store overwrites the ref counter. As far as I understand the aliasing rules, the following is ok:

float *x = …
y->detach();
x[0] = 1.f;
y->detach();

because float cannot alias int memory. But if x is of type “int *” the compiler can’t be sure that it doesn’t alias the ref counter. Perhaps you can further improve the optimization ability of the compiler by making the ref counter a unique type, like e.g.
struct IntRefCount {
int d;
};

5 Thiago Macieira December 12, 2009 at 3:45 pm
 

@AlekSi: the email address in the wordpress installation at labs is up-to-date.

@mkretz: the point is that we want the compiler to optimise like that, provided it’s safe to do so. In any case, the refcounter is either a QAtomicInt or a QBasicAtomicInt.

6 mkretz December 12, 2009 at 7:14 pm
 

Ah right. The Q(Basic)AtomicInt is the new type. Now, there my thinking was confused… Because I was thinking of the member of the Q(Basic)AtomicInt which then would be an int. (Of course that’s the same as my IntRefCount :S). Anyway, so I’m just back to “aliasing might make the optimization work less well then expected”, without a good idea how to test it or even work around it.

Try, test and benchmark I’d say. :)

7 DavidB December 12, 2009 at 7:50 pm
 

Away from the GUI I think QSockets needs to be reworked. You get nasty warnings if you try to write to a QSocket in thread that did not create it. In today’s world of many cores and many threads I consider this to be a design flaw of QSockets. I’d also like to see a QUDPServer class similiar to QTcpServer.

QSQL could use a little loving as well. It would be nice if Qt4′s designer could reach parity with Qt3′s designer that allowed you to drop and view SQL in table viewers.

With respect to the GUI, I’d like to see more default controls for maximizing the size of a Toplevel widget or Dialog. I mean if you have a one line Dialog do you really want to allow it to fill the whole screen. This kind of beha
vior is the norm rather then the exception for Linux. If I recall right, setting a maximum size on a toplevel window results in the “Show Maximum Size” window decoration not being displayed. This is annoying as a user who has shrunken a window should be able to press this icon to get it back to its maximum size (not fill the whole screen)

Window attributes are also in a messy state and need to be cleaned up. I’d like to easily be able to have no title bar and yet have move and resize handles on a top level window.

I’d like to see your form layout have more then one column as well.

I’d like to see QStylSheet and QPalette converge into a more consistent or uniform API. I think QPalette needs some attention as well in that one always has to scratch their head on how to set colors (Colors and Brushes).

Nothing grandiose here in my suggestions. I’ve been a Qt developer for a long time and really appreciate what a great job you guys have done. And while I usually like big new feature editions to Qt, I also very much like the small subtle improvements as well.

Regards
David

8 Peter December 13, 2009 at 1:52 am
 

While discussing this change theoretically is interesting, real benchmark numbers would be very helpfully, maybe it is not worth the effort. I’m looking forward to the first numbers.

9 Peter December 13, 2009 at 10:34 am
 

Another optimization idea is to avoid heap allocations and to use more intensively the stack.
In essence the code looks like this:

char bytes[sizeof(T)]; // allocate memory on stack
T* t = new ((T*) bytes) T(); // in-place new: don’t allocate on heap, call constructor

QVarLengthArray uses this trick.
But I assume it could also be used at other places.

Have you discussed this already? What was the outcome?

10 Philippe December 13, 2009 at 2:03 pm
 

Benchmark numbers would certainly be interesting, but as Thiago says, this is about “Incremental improvements”. And optimizing code that is executed all the time by all parts of Qt, is always a good thing to do. “Little brooks make great rivers”.

11 Thiago Macieira December 14, 2009 at 1:14 am
 

@DavidB: this is not really the forum for discussing features, so I’m going to restrict myself to addressing what I am in the mood for addressing :-)

Please post all your other suggestions to bugreports.qt.nokia.com.

Anyway, QTcpSocket is designed for a single thread, because it does a lot of buffering. Remember that it reacts to there being data in the kernel buffers and reads. So if we allowed you to access it from separate threads, we’d need to introduce locks. And that’s a performance penalty that we don’t want to pay because most people do it right: they use QTcpSocket in the thread that they are created.

In this world of many cores and threads, using multiple threads for networking is a waste of resources. One single thread can handle a lot more data than the fastest network connection you are likely to run into. Instead of doing what you suggest, we’ll just try to make the I/O subsystem be even more performant in terms of throughput.

And as for UDP, if you’re asking for QUdpServer, you don’t really understand UDP. QUdpSocket is already a server.

12 Thiago Macieira December 14, 2009 at 1:17 am
 

@Peter: the problem with that approach of yours is alignment. QVarLengthArray forces a high alignment constraint by putting tha data in union with a “double”. But that’s not enough for some multimedia types.

And, of course, it doesn’t help in shared and/or dynamic containers. It’s a valid suggestion for function-local buffers and lists, but it’s very hard to do properly in C++.

Remember that the following construct is valid C99 but invalid C++:

void function(int size)
{
    char buf[size];
}
13 wysota December 14, 2009 at 1:20 am
 

@DavidB: The warnings are nasty because you are trying to do a nasty thing :) It is because of QObject, not Q*Socket class. The class is not thread-safe so accessing it from random threads will lead to trouble. If all you want is to be able to access the socket from a particular thread you can use QObject::moveToThread() to push the object to the thread you want to handle it from now on. Remember the target thread has to run an event loop, otherwise network events won’t be delivered to the object.

14 DavidB December 14, 2009 at 3:39 am
 

My interest in QTCP Sockets and multiple threads might be a bit self centered. In my server project I have 16 threads running on 16 cores, and each thread number crunching on a large set of data. Away from the data I have 150 users connected via a socket to my server. Each wanting different subsets of data. In my perfect world I ‘d like to be able to send data on the socket from a given thread. Sure I have to lock the socket while it was being written but isn’t that a lot faster then sending a signal to the thread where the socket was created. I guess I could do a move parent but was not sure of the overhead.

Maybe Qt isn’t the best choice for this kind of environment. I had some good reasons for starting it with QT – 1) Qt serialises the data very well and takes care of Little/Big Endian. 2) Qt has adequate hooks for talking to a MySQL database, 3) Qt ‘s QTCPServer does do lot of what I need. Of course there’s nothing that prevents me from using Qt for everything but the sockets, and this might be the best way to go. I welcome any thoughts on this.

-David

15 Philippe December 14, 2009 at 2:07 pm
 

@ Thiago: “double alignment” is more than excellent for performances (ie. if you are using aligned double). So what do you mean with “that’s not enough for some multimedia types.” ?

16 Peter December 14, 2009 at 8:01 pm
 

@Thiago: My idea was to apply this technique to stable Qt classes where the
private d-pointers always require a heap allocation. When the class is stable it
is possible to move the declaration of the private data structure to the public
header, like in QString. Then it’s possible to remove e.g. the qMalloc call
in QString::fromLatin1_helper:

QString::Data *QString::fromLatin1_helper(const char *str, int size) {
/* this is the current code */
d = static_cast(qMalloc(sizeof(Data) + size * sizeof(QChar)));

Alignment could be calculated by the compiler, using a templates:

template struct Align {
enum { bytesNeeded = /* do the template magic*/ };
};

The code would look like this:

struct FooPrivate { /* some data */ };

struct FooPrivateAligned {
FooPrivate d;
char alignmentBytes[Align::bytesNeeded];
};

class Foo {
FooPrivate* d;
char aligned[sizeof(FooPrivateAligned)];
Foo() : d(aligned.d) {}
};

When I find the time I try it out with QString.
Are there somewhere benchmarks for QString?

17 Thiago Macieira December 14, 2009 at 10:02 pm
 

@DavidB: only the networking needs to be done in one thread. You can do the number crunching in any thread.

That’s a classic Producer-Consumer threading example, straight out of the textbooks.

18 Thiago Macieira December 14, 2009 at 10:03 pm
 

@Phillipe: some types require alignment to more than what double requires. Some vector types in multimedia require 128-bit or more.

19 Thiago Macieira December 14, 2009 at 10:08 pm
 

@Peter: no, that’s not possible.

Imagine:

QString globalString;

void foo()
{
    QString allocatedFullyOnStack("data");
    globalString = allocatedFullyOnStack;
}

If the allocatedFullyOnStack is, like its name says, allocated fully on the stack, then it can’t be shared with the globalString variable. Otherwise, when foo() returns, globalString will be pointing to garbage.

20 Peter December 15, 2009 at 11:40 pm
 

I see. Seems I was only fascinated by the placement new ;)

Comments on this entry are closed.

Previous post:

Next post: