Qt Graphics and Performance – The Raster Engine

Posted by Gunnar Sletta on December 18, 2009 · 19 comments

Todays topic is the raster engine, Qt’s software rasterizer. Its the reference implementation and the only paint engine that implements all possible feature combinations that QPainter offers.

History

The story of Qt’s software engine started around December 2004, if my memory serves me. My colleague Trond and I had been working for a while on the new painting architecture for Qt 4, codenamed “Arthur”. Trond had been working on the X11 and OpenGL 1.x engines and I was focusing on the combined Win32 GDI/GDI+ engine along with QPainter and surrounding APIs. We had introduced a few new features, such as antialiasing, alpha transparency for QColor, full world transformation support and linear gradients. As few of these new features were supported by GDI, it meant that using any of these features implied switching to GDI+, which at the time was insanely slow, at least on all the machines we had in the Oslo office back then. Actually, enabling the GDI advanced graphics mode to do transformations was also not very fast.

Then we came upon this toolkit called Anti-Grain Geometry (AGG) which did everything in software, in plain C++, and we were just amazed at what it could do. Our immediate reaction was to curl up on the floor in agony, thinking that we were going about this all wrong. Using these native API’s was not helping us at all. In fact it was preventing us from getting the feature set we wanted with a performance that was acceptable. Once we settled down again, our first idea was to try to implement a custom AGG paint engine which would just delegate all drawing into the AGG pipeline. But alas, the template nature of the AGG API combined with the extremely generic QPainter API bloated up into a pipeline that didn’t perform nearly as good as the demos we had seen.

So we took our Christmas vacation and started over in January of 2005. Still quite depressed over the new feature set that didn’t perform combined with being limited by a minimal subset of native API’s, I went to Matthias and Lars and asked if I could get three weeks of time to hack together a software only paint engine as a proof of concept. I got an “OK” and spent the following weeks implementing software pixmap transformation, bi-linear filtering, clipping support in the crudest possible way and three weeks later I had a running software paint engine and quite proudly announced that I was “just about done”. I’ve reconstructed an image of how I remember it:

groupboxes

The system clipping was all over the place, bitmap patterns were broken, but perhaps worst of all, all text is rendered using QPainterPath’s, and all drawing was antialiased. Despite it not looking 100% good, the performance of the various features was pretty ok. It was agreed that this was a good start, but that we needed a bit more work. And so started the sprint for the Qt 4.0 beta a few months later.

The initial version that was released with Qt 4.0 worked quite well in terms of features, but in hindsight the performance was far from what our users demanded from Qt. As a result, we harvested a lot of criticism over the first year of Qt 4.0. Since then, we’ve done a lot, and I mean a LOT, and my gut feeling is that it is the engine that performs the best for average Qt usage, so I think we made a good choice back then in dropping GDI and GDI+. And, as I outlined in my previous post, we are toying with making raster the default across all desktop systems for the sake of speed and consistency.

Overall structure

The overall structure of the engine is that all drawing is decomposed into horizontal bands with a coverage value, called spans. Many spans will together form the “mask” for a shape and each pixel that is inside the mask is filled using a span function.

antialiasing

The image highlights one scanline of a polygon which is filled with a linear gradient. There are 4 spans, one which fades in the opacity of the polygon and two which fade out the opacity of the gradient. For each pixel in the polygon, the gradient function is called and we write the pixel to the destination, possibly alpha blending it, if the coverage value is other than full opacity or if the pixel we got from the gradient function contains alpha.

Clipping also use the same mechanism. The span function for clipping takes the incoming spans, intersects them with the set of spans that defines the clip and calls the actual filling span function.

clipspans

All operations followed this pattern. When a drawRect call comes in, we generate a list of spans for each scan line and set up a span function according to the current brush. A pixmap is similar, we create a list of spans and use a pixmap span function. A polygon is passed to a scanconverter which produces a span list, etc. We have two scan converters, one for antialiased and one for aliased drawing. The antialiased one is pretty much a fork of FreeType’s grayraster.c, with some minor tweaks, I think we needed to add support odd-even fills, for instance. Text is also converted into spans.

Lines, Polylines and Path Strokes

These primitives are passed to a separate processor called a stroker. The stroker creates a new path that visually matches the fillable shape that the outline represents. There is a public API for this too, in QPainterPathStroker. This fillable shape is then passed to one of the scan converters which in turn scan converts the shape into spans. For dashed outlines, the same process happens, and the resulting fillable shape is a path with a potentially very large amount of subpaths. Naturally, such a sub-path is costly to scan convert, which is part of the reason why we explicitly do not put dashed lines on the list of high-performance features. In fact, in many cases, line dashing is one of the slowest operations available in the raster engine, so use it with extreme caution.

A hacky alternative which performs much better, is to set a 2×2 black/white or black/transparent pixmap brush and draw the stroke using a pen with brush. A bit more to set up, but if that’s what it takes to get in running fast, then that’s what it takes.

State changes

Any setBrush, setTransform or any other state change on QPainter will result in a different set of span functions being set up. Each brush, or fill-type if you like as pens on this level are essentially just fills too, has a special span function associated with it and we also pass a per brush span data. For solid color fills the span data contains the color, for transformed pixmap drawing it contains the inverse matrix, a source pixel pointer, bytes per line and other required information. For clips it contains the span function to call after you clipped the spans. The thing to notice about state changes is that each time you switch from one brush to another brush or from one transformation to another, these structures do need to be updated. Up to Qt 4.4, this was in many cases a noticeable performance problem, bubbling up to 10-15% in profilers when rendering graphics view scenes, but since 4.5 the impact of this is minimal.

Well, perhaps not minimal compared to drawing a 2 pixel long line, but minimal compared to filling a 64×64 rectangle. The point is that though the raster engine is the engine that probably handles state changes best of all our engines, there are some usecases where it still shows up, and it should still be minimized.

Span functions

The task of the span functions is to generate a pixel and combine it with the destination according to the current state of the painter. Though the raster engine supports rendering to any of our image formats except 8-bit indexed, it will internally do all rendering in ARGB32_Premultiplied. Premultiplied alpha has the benefit that we don’t have to multiply the alpha into the color channels and it saves us a division in the blending. The reason for doing all rendering in one format is that the alternative simply doesn’t scale. Just think of the combination of composition modes multiplied with the number of image formats a source image can have multiplied with what formats the destination can have. To support all combinations we have a generic approach where we for each span do:

  • Get the source pixels, e.g. from a gradient, pixmap, image or solid color, and convert them to ARGB32_Premultiplied.
  • Get the destination pixels and convert them to ARGB32_Premultiplied
  • Blend the source into the destination using current composition mode
  • Convert the result to destination format and write it back.

This may seem like a lot of work, so luckily the story doesn’t end there.

Special casing and Optimizations

As I outlined in the QPainter documentation patch that I added recently, which was the start of this blog series, its all about defining which scenarios we want to be fast and which scenarios we just need working. Over the years since the initial release of the raster engine in the summer of 2005, we’ve added tons of of special cases to support what we experience as the functions that are called the most and which have the most impact.

  • First of all, if you look at the things we do for each span above, you see that we convert into ARGB32_Premultiplied. Solid colors are easy to represent, gradients are generated in this format directly, so conversion only happens for images and pixmaps. If the image is ARGB32_Premultiplied, then no conversion is needed, and we just use the scanline pointer directly, without any copying. Our RGB32 format is specified to be 0xffRRGGBB, with the alpha set to 0xff. This means it is pixel-wise compatible with ARGB32_Premultiplied, which again means that it can also be used directly. If the source is ARGB32, you’ll get a memcpy for each scanline where the ARGB32 data is copied into a temporary buffer and converted to ARGB32_Premultiplied. What can you read from that: Do not draw ARGB32 images into the raster engine. Secondly, don’t open a painter on an ARGB32 image, as that implies the exact same, but when reading and writing the destination pixels. Now you know why QPixmap’s prefer to be in these formats too..
  • Source composition modes are special cased for most operations. For instance, we don’t read the destination for source operations because we know there is no blending involved, unless the spans have partial coverage that is. This means that Source is effectively just a memory write.
  • SourceOver is usually special cased to be either inlined and merged with the coverage opacity so it is also usually faster than the other composition modes. As for the other optimizations down below, these only hold for Source and SourceOver, so if you want best performance, make sure that this is what you are using. SourceOver is the default in QPainter, by the way.
  • For gradients and pixmaps, we need to create an array of source data. For solid colors, its just a single pixel, so this is faster. Source color also benefits from that you only have to traverse memory for the destination, where you write to, so the cache misses are significantly reduced.
  • Rectangle fills are very common, both through QPainter::fillRect and through QPainter::drawRect. In 4.4 both of these implied a state change. Actually, fillRect implied two state changes because it set the brush to what was passed to fillRect and then set it back to what the painter state was. In 4.5, as part of this Falcon project, we introduced a new internal QPaintEngine subclass which supports a state-less fillRect with a color. This matches how applications normally use the painter anyway.
  • In addition to being stateless, the fillRect function is special cased for a number of use-cases. For instance, for RGB16, we write two pixels at a time, for Intel machines there is an SSE/MMX optimzied version. The special cased fillRect also has the benefit that it doesn’t require spans, its just a tight 2D for loop, which also saves us quite a bit of work, at least if the spans are short.
  • Duffs Device. I cannot take credit for its addition, but it’s used in a lot of different places in the raster engine today. Its all about loop-unrolling. If you’re not familiar with it yet, read up on it. Its a beautiful abuse of the C++ language to make things potentially faster.
  • Rectangular clipping is also special cased, at least as long as there is no transformation set on the painter. Translate is of course special cased, but scaling and rotating disables this optimization. The benefit we get from doing rectangular clipping is that finding the spans to fill is done on the QRect level, rather than on the pr span level, which makes it significantly faster.
  • So if you have Source of SourceOver, a non-perspective, non-smooth transform and the clip is a rectangular clip, you also get the benefit of our pixmap blend functions. These were added in Qt 4.5 and is the reason why pixmap drawing is quite a bit faster now than in the earlier versions. In Qt 4.5, we had blend functions for scale and translate only, and in Qt 4.6 we added rotations to the list as well. Again, we focus on a selected subset of formats, matching what QPixmap will be using, we only have these for:
    • ARGB32_Premultiplied on ARGB32_Premultiplied
    • ARGB32_Premultiplied on RGB32
    • ARGB32_Premultiplied on RGB16
    • ARGB8565_Premultiplied on RGB16
    • RGB32 on RGB32
    • RGB16 on RGB16

    I think that was all of them.

  • The outlines are processed via the stroker in the general case. However, there are again a number of special cases where we drop to doing a midpoint-algorithm instead. Lines, polylines and paths that only contain line segments will be rendered using the fast midpoint approach as long as the pen width is equal to or less than 1. We also support dashing line segments for 1 pixel wide lines using this method. For any pen width greater than 1, curved paths or antialiasing, we drop to the stroker approach which works, but is far less optimal. Actually, I think there is a special-case for antialiased dashed lines too, as long as they are thin.
  • When antialiasing is enabled, we often need to fall back to the stroker for outlines which is quite a bit slower than the plain case. In addition to that there are a lot of more spans generated for antialiased content, due to the fade-in, fade-out effect on the edge of the primitive, so expect antialiasing to be a significant cost.
  • Text drawing is since 4.5 highly optimized for most engines, to the point where the major bottleneck these days are in doing the actual text layout on the string. We’re working on an API to cache this, so text drawing can be made truly fast, but based on the current API, its as good as it gets. However, if the transformation is a rotate/scale, then we fall back to path drawing. Only the windows version of the raster engine supports drawing glyphs at rotated angles using the fast paths, so beware of that.
  • A lot of details, but it gives an idea of what to consider when you write code for this engine. If all you are drawing is 1024×1024 pixmaps, then none of these things matter because all the time is anyway spent in the span function that does pixmap blending, but the second you have more content, several lines, several polygons, which are smaller in size, then these things are critical to achieve good performance.

    The overall performance of the engine, when used according to how it’s outlined above, can be thought of as:

    Overhead + O(pixelsTouched * memoryAndBusCapacity)

    There is nothing scientific about that formula, but when you’re hitting the optimal path, all time should be spent in one of the many for loops inside qdrawhelper_xxx.cpp or even better qblendfunctions.cpp. These loops will spend all their time on per pixel processing. If these functions could be made faster by doing the algorithms slightly differently, then great, but if you see in your profiling that all time is spent in for instance qt_blend_argb32_on_argb32, then that means you told us to blend alpha pixmaps together and we’re doing that as fast as we can and you have zero loss between your app and actual processing. If all time is spent processing pixels, then that is a good thing. The overhead here is the time spent in state changes, function call overhead, and similar.

    Some numbers

    I got some feedback on one of the previous blogs that a few bar charts would be nice, so I’ll post some numbers on what kind of throughput is possible with the raster paint engine. I’ve timed it on both my Windows desktop machine and on my N900 to get a comparison. The operations range from several million pr second to only a few hundred so the scale is logarithmic, keep that in mind as you look at them.

    Raster Results

    As you can see, the fill-rate is more or less tied to the number of pixels involved. For some operations it takes a little bit longer to do something, like drawPixmap with scaling is somewhat slower than drawPixmap without, but you see that the rough formula I gave above holds quite often. Double the size of the primitive in each direction and you have one quarter the performance. It was also not my intention to trick you with using different numbers for drawPixmap, its just how the test was set up.

    If you compare the three 4×4 rectangle drawing versions, you see that they differ when the rectangles are small. drawRect without brush change is fastest at around 7.4Mops/sec, followed by fillRect at ~6.1Mops/sec and then drawRect with brush change at 1.8Mops/sec. At 128×128 there is just a little difference between the two, which is what I was getting at with the state changes above. It is possible to do them and if you’re drawing semi-large areas, it doesn’t matter, but if you’re plotting pixels, doing loads of small lines here and there or particle effects with 8×8 pixmaps, then you want to do that in a tight loop with nothing else happening.

    You can also see that the speed of non-smooth scaling is holding its own vs non-scaled pixmap drawing.

    Finally, if you compare the N900 to the desktop Windows machine you see that despite windows only having a 4 times faster processor the speed is often around 10 times worse. Why? Because the CPU isn’t the only limitation, bus/memory capacity is also a limiting factor, and it’s to be honest not a fair comparison…

    I hope you enjoyed this post and more will come in 2010.

    QShare(this)

    No related posts.


    19 comments

    1 Philippe December 18, 2009 at 5:44 pm
     

    Great post again :-)
    You say: “Only the windows version of the raster engine supports drawing glyphs at rotated angles using the fast paths, so beware of that.”
    Well, in Qt 4.4 (or 4.5?), under OSX, drawing text at 90° was as a matter of fact extremely slow (but nice). I have measured again today with Qt 4.6: speed is many times faster than before, almost “normal”. But text does not look very good anymore (worse than anti aliasing).
    I know you plan to have the raster engine as default for OSX in Qt 4.7. I hesitate using the raster engine today with OSX and Qt 4.6. Any advice on pros and cons?

    2 gunnar December 18, 2009 at 9:29 pm
     

    Philippe: With “not good anymore”, do you mean that the glyphs are only gray antialiased and when the font is small, it becomes somewhat blurry? This is because transformed text is hitting the “slow” path which converts it to a QPainterPath which is then filled.

    The text drawing in the raster engine was changed in 4.5 for all platforms to use this “glyph cache” method, meaning we extract the natively rendered glyph image only once. Special support is required in the font engines to provide glyphs rasterized with transformations and only the windows font engine has this capability currently. For raster to be default on Mac we should and probably will add similar functionality to the Cocoa font engine.

    3 Carina Denkmann December 19, 2009 at 12:11 am
     

    Very interesting post indeed, thanks Gunnar.

    I’ve been using the “-graphicssystem raster” *buildtime* option since it has been available, so all of my Qt runs in that mode. Yes, I know it isn’t recommended yet to use that as the default, but I have nothing but good experience both with performance and with compatibility. It is just Kolourpaint that likes to be started with “-graphicssystem native”, and it’s a trollsend that you can change that via a simple command line argument.

    What I would like to read/learn about is the path the pixels take after they are rendered to a raster memory buffer. With raster being a pure software method, you cannot render directly to the screen’s DRAM (or do you, in case of UMA?), so you need to copy them to the screen. How are pixels blitted to the screen, and how many copies/blits/conversions are required depending on the graphics driver/window manager/compositing mode? On my system, this seems to be the bottleneck.

    4 mathpup December 19, 2009 at 7:19 am
     

    As of 4.6.0, only the native graphics engine on Mac OS X 10.6 produces good quality text. Text from the raster engine is much too thin and light. (The opengl produces good quality text, but the widgets actually seem to be missing except for their text.)

    5 Philippe December 19, 2009 at 10:12 am
     

    Yes, “too thin and ligh” is how I should describe vertical text in mode “raster” under OSX. This can be seen with the Qt demo “Main window” where we can dock/undock widgets. Rebuild this example and add “QApplication::setGraphicsSystem(QLatin1String(“raster”))” at the start.
    Run then select menu “Dock Widgets > Red > Vertical title bar”.
    The widget will first turn white (which is a bug in the raster mode only), but anyway drag and toggle this widget between floating and docked states. In the docked state, the font is too light, too thin. In floating mode, the caption bar draws the text correctly (I guess OSX does the painting and raster is not used) .
    This example is not too shocking, I admit, but I have seen worse cases with other font sizes.

    6 booklett December 19, 2009 at 1:58 pm
     

    It would be perfect if you prepare a small booklet in pdf where all the blogs are collected.
    Somehow this could extend your Quarterly series.

    7 Philippe December 19, 2009 at 3:20 pm
     

    @gunnar; blurry? No. “thin and light” is also as I would describe vertical text, in OSX with the raster engine.
    Example: take the “Main Window” Qt demo (the one where you can dock colored widgets). Built and run with the raster engine. Select the menu “Dock widgets > Red > Vertical title bar”
    First you will get a blank widget (bug in raster mode btw) but anyway drag the widget to make it float, then dock it. You will see a difference in docked state (raster) and floating mode (Mac painting I guess), for the vertical text in the caption bar. This example is not dramatic, but I have seems more differences with other fonts in other contexts.

    8 Brandon December 20, 2009 at 8:10 pm
     

    This is very interesting stuff. What level of AA do you use? 8×8?

    9 gunnar December 22, 2009 at 6:08 pm
     

    Philippe: I’ve seen this effect in the past, but I thought all those “thin” fonts were ironed out. Do you have a non-apple screen by any chance? I’ll have a look at reproducing this once I get back from vacation.

    Brandon: The antialiasing in the raster engine is not the conventinal GL YxZ multisampling. It uses the FreeType gray rastereizer and does a pr scanline accumulation per primitive and produces 256 levels antialising level from that.

    10 Jude December 22, 2009 at 7:05 pm
     

    I like Qt, its good and well designed, but for my work i needed a library that is really fast, Qt disappointed me though.
    I was specially interested in improving the text rendering time. Using the library that i am using it takes .5 ms to render an anti aliased text of reasonable size.

    I was shocked to see the same text took 40 ms in Qt , that is nearly 80 times slow :(

    11 gunnar December 22, 2009 at 8:12 pm
     

    Jude: that sounds like a lot. What is your usecase?

    12 Gregory December 22, 2009 at 9:53 pm
     

    I’m glad you mentioned AGG.

    Qt, relying on libraries and NIH syndrome… a long long story apparently :)

    13 Philippe December 23, 2009 at 1:50 pm
     

    @gunnar: Yes, I only have (good) non-apple screens. Do you mean Qt behaves differently if an Apple screen is detected?
    Again, I’m speaking of 90° rotated fonts only.

    14 mathpup December 24, 2009 at 3:54 am
     

    I wanted to clarify that my comment about text looking much too thin and light using -graphicssystem native applies to normal horizontal text, such as in the orderform demo:

    http://img34.imageshack.us/img34/8476/orderform.png

    15 gunnar December 24, 2009 at 1:29 pm
     

    mathpup/Philippe: The bug in the image looks very much like a bug we fixed a month or two back. It was visible only on non-Apple screens as Mac OS X can decide to render glyphs differently for some screens. I’ll check why this problem has resurfaced after new years.

    16 mathpup December 26, 2009 at 9:22 am
     

    I think I solved the problem for my particular situation. I apparently had AppleFontSmoothing (in the defaults database) set to 4, which affected Qt’s text rendering but not the Apple native text rendering.

    BTW: Apple changed the way that “font smoothing” was set up in the transition from Leopard to Snow Leopard. It’s now an off/on setting, where off sets AppleFontSmoothing to 0 in the defaults database, whereas enabling font smoothing actually deletes AppleFontSmoothing from the database. (Leopard permitting setting a value from 0 to 3. Snow Leopard actually understand this range of values, but does not permit anything but on/off to be selected in System Preferences.)

    One more thing: The letter spacing seem to be a little off in some situations, regardless of the graphics system. In my previous screenshot, the word “receive” is particularly bad. (Ignore the now-solved darkness of text problem.)

    17 mathpup December 26, 2009 at 9:51 am
     

    The text rendered by Qt is on the left. The text captured from TextEdit.app is on the right.

    http://img254.imageshack.us/img254/5472/qtlefttexteditright.png

    18 steveg December 29, 2009 at 11:10 pm
     

    Will you be adding support for drawing on 8bit indexed images?

    19 gunnar December 31, 2009 at 9:13 am
     

    steveg: We will not add support for rendering into 8bit indexed. It would be dithered and slow and the color table would be fixed ahead of time, causing the colors to potentially be all over the place. We have thought about supporting grayscale-only or alpha-only 8-bit, but its not on ou current roadmap.

    Comments on this entry are closed.

    Previous post:

    Next post: