Cell == NUON on steroids?

Skah T

Former VML engineer
While reading this article about the Cell architecture I'm reminded quite a bit of the Aries MPE architecture. In particular is this sections describing the individual SPEs. I added translation from Cell to NUON world in color.

Unlike the PPE [MPE3], the SPEs [MPEs 0-2] do not have caches. Instead, they each get a 256K [4-8K] "local store" that only they can see. All code and data for the SPE [MPE] must be stored within this 256K [4-8K] local area. In fact, the SPEs [MPEs] cannot "see" the rest of the chip's address space at all. They can't access each others' local stores nor can they access the PPE's [MPE3's] caches or other on-chip or off-chip resources. In effect, each SPE is blind and limited to just its own little corner of the Cell [Aries] world.

Why the crippled address map? Each SPE [MPE] is limited to just a single memory bank with deterministic access characteristics in order to guarantee its performance. Off-chip (or even on-chip) memory accesses take time--sometimes an unpredictable amount of time, and that goes against the SPE's [MPE's] purpose. They're designed to be ultra-fast and ultra-reliable units for processing streaming media, often in real-time situations where the data can't be retransmitted. By limiting their options and purpose, Cell's [Aries'] designers gave the SPEs [MPEs] deterministic performance.
You could be reading the NUON architecture manual. In fact it just struck me that experience programming for NUON would translate well to Cell. Mabye I should check around for PS3 developers ;)

One area where NUON still has the edge in being able to execute up to seven ops per clock cycle rather than just two on the Cell's SPEs.
 
R

Riff

Guest
Skah T said:
One area where NUON still has the edge in being able to execute up to seven ops per clock cycle rather than just two on the Cell's SPEs.
You cant reasonably compare the two by issue rate alone. Each SPE is clocked at 3.2 GHz compared to the Aries 3 clock rate of 108 Mhz. Assuming maximum issue throughput on both Aries 3 and Cell, the 3.2 Ghz Cell can issue 8.67 times as many instructions per second as the Aries 3.

Another factor to consider is that the biggest bottleneck of Nuon is that only MPE3 can be used to access external memory and the psuedo-cache performance is dismal. An cache miss essentially slows MPE3 down to 1 or 2 Mhz. Even when data is in the data cache, if the data is in the second set for a given address, you lose an extra cycle effectively halving the speed of the processor.

It is almost impossible to acheive a throughput of seven instructions per cycle on an MPE even with the most hardcore handcrafted assembly code. The largest packet size I ever saw was in some MGL inner loops that had five instructions per packet. Acheiving a *sustained* throughput of seven instructions per cycle is right out. On MPE3 things are even worse and the compiler will rarely create packets with even two instructions. In the absolute best case I would say the sustained throughput of the MPEs is around 2.5. The throughput of the Cell SPEs is going to be very very close to 2. The PPE will also be getting the same througput whereas on MPE3 the throughput will likely not even be 1 given the absolutely slow speed of memory accesses and huge cache miss penalties.

I can guarantee you that if the Nuon were scaled to GHz speeds on par with Cell, it would not function at all. The reason that the MPEs are able to execute so many instructions in a single cycle is because the pipeline stages are so damn slow (probably due to either the complex decode/issue stage) that completing complex operations in a single cycle is no problem.
 

Skah T

Former VML engineer
We're also comparing 5+ year old technology with something that's not even available yet. Not to mention Cell is likely partial-to-full custom whereas Aries is synthesized.

I forwarded the article to one of the Aries HW guys. He thinks in today's technology, with some repipelining, a synthesized MPE could achieve 800 MHz. And given the same transistor count as Cell you could fit 30-40 MPEs on a die. I think that would be very competitive.

Your point about hardly ever using all seven instructions per cycle is certainly valid - in compiled code at lest. Most of my sprite renderers average around three instructions per packet with a couple packets in each inner loop containing four or five instructions. I even have one that's six instructions.

I think the throughput for compiled code could be made close to two instructions per cycle given some serious work on the compiler. Like if VML hadn't gone bankrupt and had used these last five years to improve the tools :)

Cache performance can also be improved substantially. At one point the HW guys were considering just putting a MIPS core in place of one of the MPEs. In hindsight that might have been a smarter route to take.
 
R

Riff

Guest
I think the Cell based Mercury blade servers are available for purchase now, but I'm not sure. Expensive no doubt, but available.

I think that if the Nuon were re-pipelined and made at current process scale, all of the other issues would start rearing their ugly heads. Although 30 to 40 MPEs might be able to fit on a single chip, there simply wouldn't be enough bandwidth to keep them satisfied. The current bus is overwhelmed by a 720x480 32-bit framebuffer already. Imagine what it would be like if you have ten times as many MPEs fighting for allocation of the bus.

The second biggest issue is the limited usefulness of the MPE functional units. Of the seven instructions that can be activated per cycle, two are dedicated 16-bit counter decrements that are infrequently used compared to all other instructions and one functional unit is dedicated to control flow instructions which are also infrequently encountered compared to ALU operations. Meanwhile, there is ample supply of independent ALU instructions in programs but they cannot be paired because there is only one full ALU. Extending the MUL unit into a full blown ALU and ditching the RCU unit would speed up throughput immensely. At the very least it would help compensate for two cycle memory loads.

The design of the MPEs isnt horrible but it is largely tied to the goals of the Nuon chip. I think to make a successful chip, the extra baggage needs to be thrown away and the MPEs redesigned from scratch. In particular, simplify the instruction set and get rid of VLIW. Eliminate the very complex instructions like SAT, LD_P, and ST_P. Increase bandwidth as much as possible. Increase local memory size as much as possible. Eliminate everything that "sounded good at the time" but was rarely or never used in practice.
 

Skah T

Former VML engineer
Certainly I didn't mean to imply you could just port Aries3 forward as-is. Large architectural changes are required to keep various things from becoming the bottleneck as the number of MPEs increases. Some of these changes were in the works for Aries 4 and 5.

I do wish the MUL unit could handle more ALU-ish instructions. Availablility of logic ops (and, or, eor) are sometimes the bottleneck for me. It sucks having to serialize them through the ALU. I do like having the flow control instructions as a separate unit. They let me slap a branch into any packet where it's needed.

Of course not everyone (hardly anyone) wants to hand-tune assembly any more so I understand your point about ditching VLIW and other features that can only be exploited with manual assembly.
 
General chit-chat
Help Users
  • No one is chatting at the moment.
  • Varine Varine:
    They aren't cutting my responsibilities or interjecting in my management abilities or anything, it's just stupid shit that comes up. And they don't exactly have people lining up for my job, there's not that many people applying for the positions I hire
  • Varine Varine:
    Eh, whatever. Thanks for listening guys
  • jonas jonas:
    Sure :) Let us know how it ends
  • Varine Varine:
    All of these things will end happily, they're just stressful. And I still lack many good friends that I can go to, and the ones I can are preoccupied with similar things. Thus general chit chat, cuz for some reason TH and Ghan and Tom all actively keep it up.
  • Varine Varine:
    Just gotta keep Miss Mazie up through the week until her shock wears off and she realizes that she still has family all around her, and bossman will do whatever he's going to do and I'll respond appropriately when it happens. Thank you all for the support, I do very much appreciate everyone being here for me through the years
    +3
  • vypur85 vypur85:
    Best of luck Varine!
  • vypur85 vypur85:
    I just gotten myself an offer to work in China. The pay quadruples my current one. Damn.... Not really ready to start a new life there in China.
  • The Helper The Helper:
    I have heard that they pay pretty good to English teachers in China - you would be an expat
  • jonas jonas:
    Cool, what kind of job?
  • Accname Accname:
    I would be careful with jobs in China. They can be hit and miss depending on where in China you go. Places like hong kong / Shengzen / Beijing can be neat. Other places not so much.
  • Accname Accname:
    I would recommend searching for some first person experiences for the city you got the offer in. Especially now when the political situation in China is deteriorating.
  • jonas jonas:
    Accname, long time no see
  • jonas jonas:
    What have you been up to
  • tom_mai78101 tom_mai78101:
    Hey Accname, welcome back.
  • Accname Accname:
    Not much. Working in the Renewable Energy Sector as an IT Consultant. Its okay, but I think I preferred working at the university. It was more relaxed and you met all kinds of crazy people there.
  • vypur85 vypur85:
    I gotten a teaching position for Biology in a college in Wuhan (yes, there)... I suppose it should be fine there (I hope). Many of my ex colleagues are teaching in China as well currently (none in Wuhan though)
  • vypur85 vypur85:
    And I signed the contract already. I guess there's no turning back....
  • jonas jonas:
    @Accname how many hours do you work? I heard in some sectors IT consultants rack up insane hours
  • jonas jonas:
    @vypur85 sounds nice, have fun : )
  • Accname Accname:
    I am supposed to work 40 hrs a week, but I can work more if I like and I will be paid for those hours (as long as I don't go too far, there are laws and company policies, etc)
  • Accname Accname:
    In practice its basically work as much as you like, as long as the job gets done in time.
  • jonas jonas:
    Haha, my job is like that as well... that usually means I have a few 70-80 hours weeks a year, and lots of 20 hours weeks...
  • jonas jonas:
    a few weeks ago, one of my friends basically said "jonas, I received an invitation to submit something to conference X but I'm too lazy to do it and also the conference isn't advanced enough for my high level of research*, why don't you write something? Oh by the way, the deadline is in two weeks. Enjoy!" so I got two 80 hour weeks out of that kind offer. (*of course he didn't say those parts, but it's a better story this way)
  • jonas jonas:
    now I'll have next week off to make up for overtime :p and I'll play some good old gothic 2
  • The Helper The Helper:
    Hope you are enjoying that gothic 2~

    Members online

    No members online now.

    Affiliates

    Hive Workshop NUON Dome
    Top