Subtitles section Play video Print subtitles This this two-factor authentication; so basically to get in I need my card and then I need a PIN and this is a scrambler pad so basically every time that you look at that the numbers are in a different order. This is the High Performance Computing facility for the university of Nottingham. SEAN>> What do you use it for? All sorts of things it's basically to do with the high compute research so for example students and researchers will use this for doing calculations based on things like fluid dynamics, aerospace, genomics... All sorts of things anything which requires - astronomy - that's what anything that requires a large amount of compute. SEAN>> And you've got earplugs in today for obvious reasons Yes it's a litte bit noisy in here yes yeah... SEAN>> So we will do some talking outside (LINK IN DESCRIPTION) but can you show us a bit of it before we go outside? Certainly yes yes the main HPC facility which we call Minerva is... ...and then we've got some extensions in, on the racks on here. SEAN>> All of these's blinking lights what's going on is this data activity.. ...or processing what's going on there? Both! The actual lights that you can see there, the brighter ones, those are actually the storage, the disk storage the actual compute nodes don't actually blink very much. The ones at the bottom there that's the network activity. We do shut it down for maintenance once a year for a day or so this at the moment is the third generation of HPC - the first one... ...which was installed about eleven years ago and then we regularly refresh this. SEAN>> So this one's been going for how long or how long is this been? This one's been going for about four years. SEAN>> Okay and then I hear rumors of a new one on the horizon? Yes we're in the procurement at the moment to put a replacement in SEAN>> and will that mean this gets ripped out and the whole new one just gets put in? Good question, we would like to utilize as much as possible because although it is old you know there is still life left in it and we do try to - "sweat the assets" as they say but certainly some of this will be replaced. SEAN>> What's it running, would we recognise any of the operating system or any of that? It's, yes the; most of the Nodes are running a version of Linux and the the storage is fairly standard but above that we use PBS as our main scheduler. SEAN>> How many people might be using this at one time? At any one time they're probably running hundreds of jobs SEAN>> Do they run for a long time? Might they be running years? How does it work? We wouldn't have jobs that are run for years but certainly we could have jobs which are running for months. Most of the jobs -you know- we're probably only running for days SEAN>> Okay and so when you look at a system like this can you put a figure on how much it costs? Capital cost for a system like this we're probably talking in terms of about one and a half to two million pounds ($2.1m - $2.8m) The ongoing costs - We have about 250 kilowatts of air conditioning. When we run this flat out - this particular block here running flat out pulls about 70 kilowatts of power and you're drawing that all the time so to run this whole facility you're talking about thousands of pounds just purely in power costs and then of course they're all the ongoing licensing and the support for that... So it's not insignificant. SEAN>> So that's a lot of power is there a big red switch somewhere someone has to pull to turn it on? Yes there is - and no I'm not going to press it for you SEAN>> So its obviously a lot of equipment and looks like it might be quite complicated does it ever go horribly wrong? Does it ever have big problems? Generally speaking it is pretty reliable. Individual nodes will fail. Individual disks will fail but generally speaking the equipment itself is relatively... ...modern computer equipment is inherently reliable - we probably have more problems with the air conditioning than we do with the actual compute itself. SEAN>> So the other thing I was thinking about when when you look at this it's is this totally bespoke or is it's like a template or how does it work in terms of how do you buy one of these - How would you go and buy a high-performance computer? That's the $64,000 question basically you have to start to think "What do we need it for?" because there is no one generic high performance compute job. Different departments, different research, different requirements have different computing requirements. Some are very very high performance computing you know it's a lot of number crunching - others it's about manipulating data so there's a lot of data movement. Other things it's about visualization. So you the first thing you've got to do is to say right "What is our mix of jobs?" because the way which you set it up for high analytics is a different hardware set to what you set up for vizualization and things like that. So that's the first thing you've got to do. You've then basically got to say okay these are the jobs that we want to run. Once you've actually got that you then go up with a supplier to say right this is what we want to do, this is how much money we've got to spend. What can you give us? Although this is fairly old now, you know there is still quite a lot of life left in here okay it's not cutting edge - but it'll still do a lot of the jobs because a lot of the jobs are purely about number crunching. This is perfect for that so basically we will put the new one in - We will try and keep as much as we can of the old one so that that we "sweat our assets " and that also means that we've got additionally capacity for our researchers to use as well and then basically we will then go for a gradual replacement so as new processors come online and as new research projects come you know the balance of the jobs will change so that means we may have to strip out a particular type of node replace it with a different type of node but you know so that will be far more organic in the future we're not expecting in the future to do a complete rip and shred. Unless something comes up and oh you know we build a new data center - but that's not on the cards at the moment. The equipment itself is fairly generic, you know, these are standard blade enclosures. The storage is standard storage - We have about two hundred and forty terabytes in this block here - it's all connected up by InfiniBand SEAN>> Is InfiniBand a speed of network? It's a standard - This is a 40 gigabit InfiniBand gigabit SEAN>> So at home you might have Gigabit - this is 40 of those? Yes, 40 Gigabit yes - and also of course it's also multi path as well so.. ...because you know there's no point in doing a lot of calculations if you can't then get the result of those calculations off. There're effectively two types of jobs. There are parallel jobs where you've got a job running on multiple nodes and then you've got single node jobs where basically it's all running on one node. So again, with the parallel jobs you need network connectivity to make sure you're not processing the same bit twice. SEAN>> So for a researcher or someone who's a part of a project what's the big benefit of doing this rather than letting their office computer do it? Is it the speed of compute? It the fact that they can set it off and come back another day or, what's the main benefit? Yes it's the capacity. Because basically the job will start to run it will then continue to run and then so for example Christmas is a very very busy time for us because a lot of researchers will start a job going then come back after Christmas and pick up the data As I say, you you could do these things at home, it's just that it would take you months or years to do what this can do in days or hours. SEAN>> Are they 'hot' swappable then? Yes they are SEAN>> (Joking) Come on then, let's pull one out... No! They're all single-phase power but because the phase on this rack is different to the phase on this rack there is the possibility of having a potential difference of more than 400 volts across the two racks. It's unlikely because each of the... but from a "health and safety" point... and it's exactly the same why you'll see a lot of these have got laser [warning stickers] because we use laser optics SEAN>> For your networking? Er, yes the fibre... SEAN>> And what is that, the aircon? Nope, that is the fire suppression SEAN>> Oh let's go of a look at that then The fire suppression system that we have in here is it's an IG55 system which is an inert gas. It's 50% Argon, 50% Nitrogen basically if there is a fire in here all of the gas in there is released in one go that replaces about half the atmosphere in here which takes the oxygen level down to a point where it doesn't support combustion. It is just about breathable but you wouldn't want to run a marathon in it you know it's like trying to run at the top of Mount Everest. SEAN>> So it suppresses the fire without damaging the kit? Yes. The gas is released through these nozzles here. SEAN>> They look like sprinklers but they're actually gas... Gas nozzles, yes. SEAN>> and how does it work with the cooling? Is it go in hot one side and out cold the other? This is - Yes basically we use aisle containment so this is the cold aisle when we put cold air in it then goes through the equipment we'd expect to see a delta T in terms of 20-odd degrees - and on the other side basically it gets vented through... SEAN>> So through that glass is going to be 20 degrees warmer? Can we go in? yeah OK I think I'd like to spend my time on this side... If you come down here you can definitely feel the temperature difference. So these are compute nodes. SEAN>> ...and how many computers are in each one of those blocks then? Each one of here so in this particular one you've got 1 2 3 4... ...8 individual blades in this blade enclosure here. You asked about the big red button? That's the big red button SEAN>> That would turn it off and on? No, that would turn it off. SEAN>> Ah that's like a "Danger danger!" - press that? Basically if I press that then everything will die immediately SEAN>> let's stay away from the big red button then... But that is the big red button, yes.... Assuming that they are separate parts of the CPU if we look back at our instructions here we execute instruction 1 it uses the load/store unit.. complicated. The point is what we're doing is by multiplying G by various numbers or adding it to itself - this point addition - we're moving around this curve sort of seemingly at random
B1 sean compute high performance basically big red gigabit High Performance Computing (HPC) - Computerphile 4 0 林宜悉 posted on 2020/03/27 More Share Save Report Video vocabulary