![]() |
![]() |
|
![]() |
||
|
|
||
|
|||||||
![]() |
|
|
Thread Tools | Display Modes |
|
#1
|
||||
|
||||
|
Cell == NUON on steroids?
While reading this article about the Cell architecture I'm reminded quite a bit of the Aries MPE architecture. In particular is this sections describing the individual SPEs. I added translation from Cell to NUON world in color.
Quote:
![]() One area where NUON still has the edge in being able to execute up to seven ops per clock cycle rather than just two on the Cell's SPEs. |
|
#2
|
|||
|
|||
|
Quote:
Another factor to consider is that the biggest bottleneck of Nuon is that only MPE3 can be used to access external memory and the psuedo-cache performance is dismal. An cache miss essentially slows MPE3 down to 1 or 2 Mhz. Even when data is in the data cache, if the data is in the second set for a given address, you lose an extra cycle effectively halving the speed of the processor. It is almost impossible to acheive a throughput of seven instructions per cycle on an MPE even with the most hardcore handcrafted assembly code. The largest packet size I ever saw was in some MGL inner loops that had five instructions per packet. Acheiving a *sustained* throughput of seven instructions per cycle is right out. On MPE3 things are even worse and the compiler will rarely create packets with even two instructions. In the absolute best case I would say the sustained throughput of the MPEs is around 2.5. The throughput of the Cell SPEs is going to be very very close to 2. The PPE will also be getting the same througput whereas on MPE3 the throughput will likely not even be 1 given the absolutely slow speed of memory accesses and huge cache miss penalties. I can guarantee you that if the Nuon were scaled to GHz speeds on par with Cell, it would not function at all. The reason that the MPEs are able to execute so many instructions in a single cycle is because the pipeline stages are so damn slow (probably due to either the complex decode/issue stage) that completing complex operations in a single cycle is no problem. |
|
#3
|
||||
|
||||
|
We're also comparing 5+ year old technology with something that's not even available yet. Not to mention Cell is likely partial-to-full custom whereas Aries is synthesized.
I forwarded the article to one of the Aries HW guys. He thinks in today's technology, with some repipelining, a synthesized MPE could achieve 800 MHz. And given the same transistor count as Cell you could fit 30-40 MPEs on a die. I think that would be very competitive. Your point about hardly ever using all seven instructions per cycle is certainly valid - in compiled code at lest. Most of my sprite renderers average around three instructions per packet with a couple packets in each inner loop containing four or five instructions. I even have one that's six instructions. I think the throughput for compiled code could be made close to two instructions per cycle given some serious work on the compiler. Like if VML hadn't gone bankrupt and had used these last five years to improve the tools ![]() Cache performance can also be improved substantially. At one point the HW guys were considering just putting a MIPS core in place of one of the MPEs. In hindsight that might have been a smarter route to take. |
|
#4
|
|||
|
|||
|
I think the Cell based Mercury blade servers are available for purchase now, but I'm not sure. Expensive no doubt, but available.
I think that if the Nuon were re-pipelined and made at current process scale, all of the other issues would start rearing their ugly heads. Although 30 to 40 MPEs might be able to fit on a single chip, there simply wouldn't be enough bandwidth to keep them satisfied. The current bus is overwhelmed by a 720x480 32-bit framebuffer already. Imagine what it would be like if you have ten times as many MPEs fighting for allocation of the bus. The second biggest issue is the limited usefulness of the MPE functional units. Of the seven instructions that can be activated per cycle, two are dedicated 16-bit counter decrements that are infrequently used compared to all other instructions and one functional unit is dedicated to control flow instructions which are also infrequently encountered compared to ALU operations. Meanwhile, there is ample supply of independent ALU instructions in programs but they cannot be paired because there is only one full ALU. Extending the MUL unit into a full blown ALU and ditching the RCU unit would speed up throughput immensely. At the very least it would help compensate for two cycle memory loads. The design of the MPEs isnt horrible but it is largely tied to the goals of the Nuon chip. I think to make a successful chip, the extra baggage needs to be thrown away and the MPEs redesigned from scratch. In particular, simplify the instruction set and get rid of VLIW. Eliminate the very complex instructions like SAT, LD_P, and ST_P. Increase bandwidth as much as possible. Increase local memory size as much as possible. Eliminate everything that "sounded good at the time" but was rarely or never used in practice. |
|
#5
|
||||
|
||||
|
Certainly I didn't mean to imply you could just port Aries3 forward as-is. Large architectural changes are required to keep various things from becoming the bottleneck as the number of MPEs increases. Some of these changes were in the works for Aries 4 and 5.
I do wish the MUL unit could handle more ALU-ish instructions. Availablility of logic ops (and, or, eor) are sometimes the bottleneck for me. It sucks having to serialize them through the ALU. I do like having the flow control instructions as a separate unit. They let me slap a branch into any packet where it's needed. Of course not everyone (hardly anyone) wants to hand-tune assembly any more so I understand your point about ditching VLIW and other features that can only be exploited with manual assembly. |
![]() |
| Thread Tools | |
| Display Modes | |
|
|