Cell == NUON on steroids?

Skah T

Former VML engineer
Reaction score
62
While reading this article about the Cell architecture I'm reminded quite a bit of the Aries MPE architecture. In particular is this sections describing the individual SPEs. I added translation from Cell to NUON world in color.

Unlike the PPE [MPE3], the SPEs [MPEs 0-2] do not have caches. Instead, they each get a 256K [4-8K] "local store" that only they can see. All code and data for the SPE [MPE] must be stored within this 256K [4-8K] local area. In fact, the SPEs [MPEs] cannot "see" the rest of the chip's address space at all. They can't access each others' local stores nor can they access the PPE's [MPE3's] caches or other on-chip or off-chip resources. In effect, each SPE is blind and limited to just its own little corner of the Cell [Aries] world.

Why the crippled address map? Each SPE [MPE] is limited to just a single memory bank with deterministic access characteristics in order to guarantee its performance. Off-chip (or even on-chip) memory accesses take time--sometimes an unpredictable amount of time, and that goes against the SPE's [MPE's] purpose. They're designed to be ultra-fast and ultra-reliable units for processing streaming media, often in real-time situations where the data can't be retransmitted. By limiting their options and purpose, Cell's [Aries'] designers gave the SPEs [MPEs] deterministic performance.

You could be reading the NUON architecture manual. In fact it just struck me that experience programming for NUON would translate well to Cell. Mabye I should check around for PS3 developers ;)

One area where NUON still has the edge in being able to execute up to seven ops per clock cycle rather than just two on the Cell's SPEs.
 
R

Riff

Guest
Skah T said:
One area where NUON still has the edge in being able to execute up to seven ops per clock cycle rather than just two on the Cell's SPEs.

You cant reasonably compare the two by issue rate alone. Each SPE is clocked at 3.2 GHz compared to the Aries 3 clock rate of 108 Mhz. Assuming maximum issue throughput on both Aries 3 and Cell, the 3.2 Ghz Cell can issue 8.67 times as many instructions per second as the Aries 3.

Another factor to consider is that the biggest bottleneck of Nuon is that only MPE3 can be used to access external memory and the psuedo-cache performance is dismal. An cache miss essentially slows MPE3 down to 1 or 2 Mhz. Even when data is in the data cache, if the data is in the second set for a given address, you lose an extra cycle effectively halving the speed of the processor.

It is almost impossible to acheive a throughput of seven instructions per cycle on an MPE even with the most hardcore handcrafted assembly code. The largest packet size I ever saw was in some MGL inner loops that had five instructions per packet. Acheiving a *sustained* throughput of seven instructions per cycle is right out. On MPE3 things are even worse and the compiler will rarely create packets with even two instructions. In the absolute best case I would say the sustained throughput of the MPEs is around 2.5. The throughput of the Cell SPEs is going to be very very close to 2. The PPE will also be getting the same througput whereas on MPE3 the throughput will likely not even be 1 given the absolutely slow speed of memory accesses and huge cache miss penalties.

I can guarantee you that if the Nuon were scaled to GHz speeds on par with Cell, it would not function at all. The reason that the MPEs are able to execute so many instructions in a single cycle is because the pipeline stages are so damn slow (probably due to either the complex decode/issue stage) that completing complex operations in a single cycle is no problem.
 

Skah T

Former VML engineer
Reaction score
62
We're also comparing 5+ year old technology with something that's not even available yet. Not to mention Cell is likely partial-to-full custom whereas Aries is synthesized.

I forwarded the article to one of the Aries HW guys. He thinks in today's technology, with some repipelining, a synthesized MPE could achieve 800 MHz. And given the same transistor count as Cell you could fit 30-40 MPEs on a die. I think that would be very competitive.

Your point about hardly ever using all seven instructions per cycle is certainly valid - in compiled code at lest. Most of my sprite renderers average around three instructions per packet with a couple packets in each inner loop containing four or five instructions. I even have one that's six instructions.

I think the throughput for compiled code could be made close to two instructions per cycle given some serious work on the compiler. Like if VML hadn't gone bankrupt and had used these last five years to improve the tools :)

Cache performance can also be improved substantially. At one point the HW guys were considering just putting a MIPS core in place of one of the MPEs. In hindsight that might have been a smarter route to take.
 
R

Riff

Guest
I think the Cell based Mercury blade servers are available for purchase now, but I'm not sure. Expensive no doubt, but available.

I think that if the Nuon were re-pipelined and made at current process scale, all of the other issues would start rearing their ugly heads. Although 30 to 40 MPEs might be able to fit on a single chip, there simply wouldn't be enough bandwidth to keep them satisfied. The current bus is overwhelmed by a 720x480 32-bit framebuffer already. Imagine what it would be like if you have ten times as many MPEs fighting for allocation of the bus.

The second biggest issue is the limited usefulness of the MPE functional units. Of the seven instructions that can be activated per cycle, two are dedicated 16-bit counter decrements that are infrequently used compared to all other instructions and one functional unit is dedicated to control flow instructions which are also infrequently encountered compared to ALU operations. Meanwhile, there is ample supply of independent ALU instructions in programs but they cannot be paired because there is only one full ALU. Extending the MUL unit into a full blown ALU and ditching the RCU unit would speed up throughput immensely. At the very least it would help compensate for two cycle memory loads.

The design of the MPEs isnt horrible but it is largely tied to the goals of the Nuon chip. I think to make a successful chip, the extra baggage needs to be thrown away and the MPEs redesigned from scratch. In particular, simplify the instruction set and get rid of VLIW. Eliminate the very complex instructions like SAT, LD_P, and ST_P. Increase bandwidth as much as possible. Increase local memory size as much as possible. Eliminate everything that "sounded good at the time" but was rarely or never used in practice.
 

Skah T

Former VML engineer
Reaction score
62
Certainly I didn't mean to imply you could just port Aries3 forward as-is. Large architectural changes are required to keep various things from becoming the bottleneck as the number of MPEs increases. Some of these changes were in the works for Aries 4 and 5.

I do wish the MUL unit could handle more ALU-ish instructions. Availablility of logic ops (and, or, eor) are sometimes the bottleneck for me. It sucks having to serialize them through the ALU. I do like having the flow control instructions as a separate unit. They let me slap a branch into any packet where it's needed.

Of course not everyone (hardly anyone) wants to hand-tune assembly any more so I understand your point about ditching VLIW and other features that can only be exploited with manual assembly.
 
General chit-chat
Help Users
  • No one is chatting at the moment.
  • The Helper The Helper:
    The bots will show up as users online in the forum software but they do not show up in my stats tracking. I am sure there are bots in the stats but the way alot of the bots treat the site do not show up on the stats
  • Varine Varine:
    I want to build a filtration system for my 3d printer, and that shit is so much more complicated than I thought it would be
  • Varine Varine:
    Apparently ABS emits styrene particulates which can be like .2 micrometers, which idk if the VOC detectors I have can even catch that
  • Varine Varine:
    Anyway I need to get some of those sensors and two air pressure sensors installed before an after the filters, which I need to figure out how to calculate the necessary pressure for and I have yet to find anything that tells me how to actually do that, just the cfm ratings
  • Varine Varine:
    And then I have to set up an arduino board to read those sensors, which I also don't know very much about but I have a whole bunch of crash course things for that
  • Varine Varine:
    These sensors are also a lot more than I thought they would be. Like 5 to 10 each, idk why but I assumed they would be like 2 dollars
  • Varine Varine:
    Another issue I'm learning is that a lot of the air quality sensors don't work at very high ambient temperatures. I'm planning on heating this enclosure to like 60C or so, and that's the upper limit of their functionality
  • Varine Varine:
    Although I don't know if I need to actually actively heat it or just let the plate and hotend bring the ambient temp to whatever it will, but even then I need to figure out an exfiltration for hot air. I think I kind of know what to do but it's still fucking confusing
  • The Helper The Helper:
    Maybe you could find some of that information from AC tech - like how they detect freon and such
  • Varine Varine:
    That's mostly what I've been looking at
  • Varine Varine:
    I don't think I'm dealing with quite the same pressures though, at the very least its a significantly smaller system. For the time being I'm just going to put together a quick scrubby box though and hope it works good enough to not make my house toxic
  • Varine Varine:
    I mean I don't use this enough to pose any significant danger I don't think, but I would still rather not be throwing styrene all over the air
  • The Helper The Helper:
    New dessert added to recipes Southern Pecan Praline Cake https://www.thehelper.net/threads/recipe-southern-pecan-praline-cake.193555/
  • The Helper The Helper:
    Another bot invasion 493 members online most of them bots that do not show up on stats
  • Varine Varine:
    I'm looking at a solid 378 guests, but 3 members. Of which two are me and VSNES. The third is unlisted, which makes me think its a ghost.
    +1
  • The Helper The Helper:
    Some members choose invisibility mode
    +1
  • The Helper The Helper:
    I bitch about Xenforo sometimes but it really is full featured you just have to really know what you are doing to get the most out of it.
  • The Helper The Helper:
    It is just not easy to fix styles and customize but it definitely can be done
  • The Helper The Helper:
    I do know this - xenforo dropped the ball by not keeping the vbulletin reputation comments as a feature. The loss of the Reputation comments data when we switched to Xenforo really was the death knell for the site when it came to all the users that left. I know I missed it so much and I got way less interested in the site when that feature was gone and I run the site.
  • Blackveiled Blackveiled:
    People love rep, lol
    +1
  • The Helper The Helper:
    The recipe today is Sloppy Joe Casserole - one of my faves LOL https://www.thehelper.net/threads/sloppy-joe-casserole-with-manwich.193585/
  • The Helper The Helper:
    Decided to put up a healthier type recipe to mix it up - Honey Garlic Shrimp Stir-Fry https://www.thehelper.net/threads/recipe-honey-garlic-shrimp-stir-fry.193595/
  • The Helper The Helper:
    Here is another comfort food favorite - Million Dollar Casserole - https://www.thehelper.net/threads/recipe-million-dollar-casserole.193614/

      The Helper Discord

      Members online

      Affiliates

      Hive Workshop NUON Dome World Editor Tutorials

      Network Sponsors

      Apex Steel Pipe - Buys and sells Steel Pipe.
      Top