Cell == NUON on steroids?

Skah T

Former VML engineer
Reaction score
62
While reading this article about the Cell architecture I'm reminded quite a bit of the Aries MPE architecture. In particular is this sections describing the individual SPEs. I added translation from Cell to NUON world in color.

Unlike the PPE [MPE3], the SPEs [MPEs 0-2] do not have caches. Instead, they each get a 256K [4-8K] "local store" that only they can see. All code and data for the SPE [MPE] must be stored within this 256K [4-8K] local area. In fact, the SPEs [MPEs] cannot "see" the rest of the chip's address space at all. They can't access each others' local stores nor can they access the PPE's [MPE3's] caches or other on-chip or off-chip resources. In effect, each SPE is blind and limited to just its own little corner of the Cell [Aries] world.

Why the crippled address map? Each SPE [MPE] is limited to just a single memory bank with deterministic access characteristics in order to guarantee its performance. Off-chip (or even on-chip) memory accesses take time--sometimes an unpredictable amount of time, and that goes against the SPE's [MPE's] purpose. They're designed to be ultra-fast and ultra-reliable units for processing streaming media, often in real-time situations where the data can't be retransmitted. By limiting their options and purpose, Cell's [Aries'] designers gave the SPEs [MPEs] deterministic performance.

You could be reading the NUON architecture manual. In fact it just struck me that experience programming for NUON would translate well to Cell. Mabye I should check around for PS3 developers ;)

One area where NUON still has the edge in being able to execute up to seven ops per clock cycle rather than just two on the Cell's SPEs.
 
R

Riff

Guest
Skah T said:
One area where NUON still has the edge in being able to execute up to seven ops per clock cycle rather than just two on the Cell's SPEs.

You cant reasonably compare the two by issue rate alone. Each SPE is clocked at 3.2 GHz compared to the Aries 3 clock rate of 108 Mhz. Assuming maximum issue throughput on both Aries 3 and Cell, the 3.2 Ghz Cell can issue 8.67 times as many instructions per second as the Aries 3.

Another factor to consider is that the biggest bottleneck of Nuon is that only MPE3 can be used to access external memory and the psuedo-cache performance is dismal. An cache miss essentially slows MPE3 down to 1 or 2 Mhz. Even when data is in the data cache, if the data is in the second set for a given address, you lose an extra cycle effectively halving the speed of the processor.

It is almost impossible to acheive a throughput of seven instructions per cycle on an MPE even with the most hardcore handcrafted assembly code. The largest packet size I ever saw was in some MGL inner loops that had five instructions per packet. Acheiving a *sustained* throughput of seven instructions per cycle is right out. On MPE3 things are even worse and the compiler will rarely create packets with even two instructions. In the absolute best case I would say the sustained throughput of the MPEs is around 2.5. The throughput of the Cell SPEs is going to be very very close to 2. The PPE will also be getting the same througput whereas on MPE3 the throughput will likely not even be 1 given the absolutely slow speed of memory accesses and huge cache miss penalties.

I can guarantee you that if the Nuon were scaled to GHz speeds on par with Cell, it would not function at all. The reason that the MPEs are able to execute so many instructions in a single cycle is because the pipeline stages are so damn slow (probably due to either the complex decode/issue stage) that completing complex operations in a single cycle is no problem.
 

Skah T

Former VML engineer
Reaction score
62
We're also comparing 5+ year old technology with something that's not even available yet. Not to mention Cell is likely partial-to-full custom whereas Aries is synthesized.

I forwarded the article to one of the Aries HW guys. He thinks in today's technology, with some repipelining, a synthesized MPE could achieve 800 MHz. And given the same transistor count as Cell you could fit 30-40 MPEs on a die. I think that would be very competitive.

Your point about hardly ever using all seven instructions per cycle is certainly valid - in compiled code at lest. Most of my sprite renderers average around three instructions per packet with a couple packets in each inner loop containing four or five instructions. I even have one that's six instructions.

I think the throughput for compiled code could be made close to two instructions per cycle given some serious work on the compiler. Like if VML hadn't gone bankrupt and had used these last five years to improve the tools :)

Cache performance can also be improved substantially. At one point the HW guys were considering just putting a MIPS core in place of one of the MPEs. In hindsight that might have been a smarter route to take.
 
R

Riff

Guest
I think the Cell based Mercury blade servers are available for purchase now, but I'm not sure. Expensive no doubt, but available.

I think that if the Nuon were re-pipelined and made at current process scale, all of the other issues would start rearing their ugly heads. Although 30 to 40 MPEs might be able to fit on a single chip, there simply wouldn't be enough bandwidth to keep them satisfied. The current bus is overwhelmed by a 720x480 32-bit framebuffer already. Imagine what it would be like if you have ten times as many MPEs fighting for allocation of the bus.

The second biggest issue is the limited usefulness of the MPE functional units. Of the seven instructions that can be activated per cycle, two are dedicated 16-bit counter decrements that are infrequently used compared to all other instructions and one functional unit is dedicated to control flow instructions which are also infrequently encountered compared to ALU operations. Meanwhile, there is ample supply of independent ALU instructions in programs but they cannot be paired because there is only one full ALU. Extending the MUL unit into a full blown ALU and ditching the RCU unit would speed up throughput immensely. At the very least it would help compensate for two cycle memory loads.

The design of the MPEs isnt horrible but it is largely tied to the goals of the Nuon chip. I think to make a successful chip, the extra baggage needs to be thrown away and the MPEs redesigned from scratch. In particular, simplify the instruction set and get rid of VLIW. Eliminate the very complex instructions like SAT, LD_P, and ST_P. Increase bandwidth as much as possible. Increase local memory size as much as possible. Eliminate everything that "sounded good at the time" but was rarely or never used in practice.
 

Skah T

Former VML engineer
Reaction score
62
Certainly I didn't mean to imply you could just port Aries3 forward as-is. Large architectural changes are required to keep various things from becoming the bottleneck as the number of MPEs increases. Some of these changes were in the works for Aries 4 and 5.

I do wish the MUL unit could handle more ALU-ish instructions. Availablility of logic ops (and, or, eor) are sometimes the bottleneck for me. It sucks having to serialize them through the ALU. I do like having the flow control instructions as a separate unit. They let me slap a branch into any packet where it's needed.

Of course not everyone (hardly anyone) wants to hand-tune assembly any more so I understand your point about ditching VLIW and other features that can only be exploited with manual assembly.
 
General chit-chat
Help Users
  • No one is chatting at the moment.

      The Helper Discord

      Members online

      No members online now.

      Affiliates

      Hive Workshop NUON Dome World Editor Tutorials

      Network Sponsors

      Apex Steel Pipe - Buys and sells Steel Pipe.
      Top