What are the most difficult concepts, tasks and terms for new supercomputer users to understand?
Threaded debugging is difficult. Finding the location where a lock has not been made on a global piece of data and multiple tasks are accessing it, one in read the other in write mode. The thread checkers I have tried (Intel’s tcheck and inspect) help quite a bit, but they slow down the running application by 100x, so they may not help on a real problem. One has to then scale down the problem, yet still produce the error, to use the tool. Totalview does help with debugging threaded code, but again, it takes time.
Another difficult task is finding enough work for speedup to continue in strong scaling problems. With 500+ core machines on the horizon, this will become more difficult and more important. It’s difficult to get 16x speedup on a 16 core chip … getting 500x speedup on a fixed size problem is not trivial.
I think MPI in general is not very easy to comprehand.
I’ll throw in scaling as another concept which can be difficult when scaling to large numbers of processors. While it is easy to scale to 1000 processors, scaling to 50,000 or 100,000 and more is a whole new ball game.
As an example, suppose you have a data structure which holds some information about how the application is distributed across the machine. Perhaps so that each MPI process knows where to send or receive messages. The memory for this when one has 1000 processors is some small part of your memory footprint. But when the application is scaled up to 50,000 processors, this memory can become significant, and at 100,000 processors say may prevent the application from even fitting on the machine. Thus the data layout, and the algorithms which work on the data, need to be reworked, with scaling as a criteria in the design of both.
Regarding on how to manage MPI and inter core information, please note Dr. David Ungar, Many Core processors: Everything You Know (about Parallel Programming) Is Wrong! http://bitly.com/ymP56s, The difficulty pointed out by L.S. has a meaning. Let me quote from the blog:
“In our Renaissance project at IBM, Brussels, and Portland State, we are investigating what we call “anti-lock,” “race-and-repair,” or “end-to-end nondeterministic” computing. As part of this effort, we have build a Smalltalk system that runs on the 64-core Tilera chip, and have experimented with dynamic languages atop this system. When we give up synchronization, we of necessity give up determinism. There seems to be a fundamental tradeoff between determinism and performance, just as there once seemed to be a tradeoff between static checking and performance.”
In other words, there is more than one answer. In applications like crash simulations, we knew that all along. Note the last comment to the blog, has the link to the original slides from Dr. Ungar. But even more interesting, is to watch the video interview with Dr. Ungar
Quoting from it, in reference to a 1000 core system:
“If we cannot skirt Amdahl’s Law, the last 900 cores will do us no good whatsoever. What does this mean? We cannot afford even tiny amounts of serialization.”
The gist of this, for users of supercomputing applications, is that a fixed size problem, which takes say 1000 hours to run in serial mode, will not run in 1 hour on 1000 processors, due to Amdahl’s law.
Serial sections, barriers, synchronizations, communication between processes, i/o, all slow down the application.
For many users the payoff isn’t that one can run a problem 1000 times quicker, but that they may be able to run problem which is 1000 times bigger.
I don’t think this is necessarily a difficult concept to explain to users, but it is important. They need to understand weak-scaling vs. strong-scaling.
I would also add understanding that if other users on the cluster are also using shared resources like storage, their job can affect the results of your job.
Amdahl’s law isn’t always an impediment. The key issue is problem size.
Codes already exist in the scientific (super)computing community that scale to over
100,000 cores. You need really huge problems, but they do exist. If your going is to make
Angry Birds run 1000 times faster, you’re probably out of luck.
I think one of the biggest impediments to new users is understanding performance
portability issues. Code that runs very fast on one system may not be very fast on
another. This is becoming worse with the diversity of recent hardware (GPUs, many-
The sequential part of a program (Amdahl’s “limiting factor”) can often be broken into parallel tasks, reducing the overall execution time. Almost all speedup potential depends on the ability to perform both data parallelization and task parallelization. If there is a single part of the code that will not succumb to either then Amdahl’s limit will prevent further speedup — time for a faster processor or a new algorithm.
As Donald pointed out above, often accessing the file system is one of those limiting sequential activities that must be completed before the program can continue, especially in initialization and restart loading. How do you efficiently load initial data and programs into a million processors?NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!