AMD’s Bulldozer CMT Scaling

Posted by Aten-Ra in Feb 01, 2012, under Reviews/Articles

Introduction

Lately it’s becoming more and more obvious that single thread performance cannot scale higher that easily than it did a few years ago. It takes enormous amount of resources and time to just raise the IPC (Instructions Per Cycle) a few percentage higher. To raise the performance even higher than that, it will require raising the size of the core substantially resulting in unwanted enormous higher power consumption for the performance gain.

.
Today’s IT needs are focused more on power reduction and higher performance per power usage than absolute brute force that was the norm in the past. From servers to workstations and a few years now down to desktops, the need for more parallelization is becoming a necessity.  Microprocessors have changed from a single core to multi core designs as engineers are trying to find ways to raise performance with fewer diminishing returns.

.
CMP, SMT and CMT
The easy way for higher thread parallelism is to install more cores in the same die, which is called CMP (Chip Multi Processor).  An example of the CMP is AMDs Hexa-core Phenom II X6 processors with 6 cores in a single die, sharing the same L3 cache. The down side is a bigger die and higher power usage depending of the number of cores incorporated in to the same die.

.
Although Intel has quad and hexa-core CMP microprocessors they have also implemented the SMT (Simultaneous Multi-Threading) known by the name Hyper-Treading in to their microprocessor core design. In the SMT, each core can process (fetch, decode, execute and retire) two threads simultaneously by sharing all the resources of the single core by the two threads. By doing that, we have higher parallelization and at the same time keeping the die size and power levels down.  Because of the sharing nature of the SMT, the second thread can only access the resources of the core that the first thread cannot use, resulting in a smaller performance scaling than the CMP but with much smaller die size and much less power usage.

.
AMD has recently incorporated a new method called CMT (Cluster Multi-Threading) in to its new Bulldozer microprocessors.  In CMT, each module can process two threads simultaneously using shared and dedicated resources of the module. The fundamental difference between SMT and CMT is that the later has more dedicated recourses to support the processing of two threads simultaneously. Because of that, multithreading performance scaling is higher in the CMT design than in SMT but lower than CMP.  AMD claims that its CMT architecture can have 80% of the CMP performance with a smaller die size and less power usage.

.

Testing procedure
In order to measure the Thread scaling performance we used the following three microprocessors from AMD and Intel.
AMD Phenom II X6 1100T, AMD FX 8150 and Intel Core i7 2600K.

.

Processor frequencies remained constant at base level by disabling Turbo in all three processors.

.
Phenom II X6 was used as a reference for the CMP processor. We measured with a single core, dual core, quad core and 6-core. Base frequency 3.3GHz

.

Intel Core i7 2600K was used as the SMT processor since it has Hyper-Threading.   We measured with a single core, single core with HT (dual threads SMT), dual core(CMP), dual core + HT (4 threads SMT), quad core(CMP) and quad core + HT (8 Threads SMT).  Base frequency 3.4GHz

.

AMD FX8150 was used for the CMT processor. It was measured with single core, single Module, two cores, two modules, four cores and four modules.  Base frequency 3.6GHz

.
Mod x1 Core 1 = Single thread (Only a single core in a single Module)
Mod x1 Threads x2 = Dual Threads CMT (both cores in a single Module)
Mod x2 Threads x2 = Dual Threads CMP (Only one core per two Modules used)
Mod x2 Threads x4 = Quad Threads CMT (dual Modules with four cores)
Mod x4 Threads x4 = Quad Threads CMP (Four Modules with only one core per Module)
Mod 4 Threads x8 = 8 Threads CMT (Four Modules with 2 cores each)

.

The rest of the hardware
Motherboard for the AMD Processors: ASUS Crosshair V Formula
Motherboard for the Intel Processor: GIGABYTE Z68XP-UD3P
Memory: 2x 4GB Kingston DDR-3 1600MHz 9-9-9-12
VGA: ASUS HD6950 1GB at 889MHz core and 1300MHz for the memory.
HDD: Seagate 500GB SATA II 7200rpm
PSU: ThermalTake SP-730P 730W 80+
Windows 7 Ultimate 64bit

.

Software Used
POV-Ray 3,7 RC (Balcony Project at 1024×768, AA 0,3)
Cinebench 11,5 (Multithread)
7-zip (32MB)
x264 HD v4.0
TrueCrypt  (500MB AES)

POV-Ray 3,7 RC

We start with POV-Ray, AMD’s FX single thread performance is 21% lower than last generation Phenom II and 74% lower than Intel’s Core i7.
Scaling from single core to dual Tread in CMT is 77,14%  when SMT scaling is at 31,9%. With 8 Threads the FX8150 scales higher than 8 thread SMT.

.

Cinebench 11,5 (MultiThread)

The single core performance of the FX8150 is very low in this benchmark, but CMT scaling is very high.  SMT scales at 24,63% at dual thread and CMP scaling of four threads is at 399,25 for the Intel processor. FX exhibits the higher scaling both in CMP and CMT modes against the Core i7 in this test .


.

7-zip

The 7zip score is the compination of both Compress and Decompress scores of a 32MB file. FX 8150 single thread performance is on par with Phenom II but still lags 32% of the Core i7 2600K. Again FX-1850 has the higher scaling both in CMP and CMT modes and it manages to catch the performance of an 8 thread SMT Core i7 2600K.


.

x264 HD v4.0 (Second Pass)

This is the first benchmark that FX8150 single thread performance is faster than Phenom II but Intel Core i7 is still 26,5% faster.  Because FX has a strong single thread performance and higher thread scaling it manages to catche Intels Core i7 performance at the Dual thread SMT mode and continues to be in front of it even in quad and 8 thread SMT mode.


.

TrueCrypt (AES)

Because Phenom II doesn’t have AES, we only tested FX and Core I processors in this benchmark. Again FXs single thread performance is lower by 35% than Intels Core i7 but due to higher scaling it catches the Intel processor at the end.

Conclusion
AMDs CMT design scales much higher than Intels SMT and it is close to the 80% claims AMD have made.  It seams that if FXs single thread performance is close to or better than 35% in relationship to Core i7 single thread performance,  then both CPUs perform the same at a higher thread level.  But if the single core performance is lower than 35% then the higher CMT scaling cannot help the AMD FX processor to catch Intel’s stronger single thread performance.

.
It will be very interesting to see what will happen if AMD could increase its single thread performance keeping the higher scaling at the same time. CMT is an interesting and new technology for the desktop platform and windows as well software is not yet optimized for it. When software can take advantage of FX SIMD’s instructions and CMT scaling as shown in AES TrueCrypt, the performance is in par with the Intel Core i7.
The AMD CMT architecture in Bulldozer has a lot of potential and performance will only go up in the future with Piledriver and more optimized software.


8 Responses

Archives