Supercomputing circa 2003

By Dr. Jack A. Shulman, rev .02

In the modern era, supercomputing has been successfully collapsed onto a single chip: the superscalar RISC chip. Several examples of the modern “supercomputer chip:” the Intel Itanium 2, the IBM RS6000 Power, the MIPS and the Sun Sparc, have made very successful debuts as standalone server platforms. In IBM’s case, such have evolved into actual supercomputers, albeit, the pen is mightier than the sword at IBM. The actual implementation in IBM’s case produces varying degrees of success, mainly lack thereof to harness all the horsepower. Until now, no one had considered how to automate the “supercomputer speedup processes” for software to facilitate use of all the performance, that is, adopting the right architecture that suited itself to actual needs software places on the average supercomputer (is that a malapropism? What supercomputer is average?).

American Computer diverged from the mainstream in creating it’s “Advanced HyperSystems” project in 1993. After I left IBM’s contracting base, it’s owners asked me to come in and progressively blackboard new technology that postdates anything today offered by the mainstream, including IBM, until an acceptable design for a new kind of supercomputer was accomplished.

And so, for 9 years, we’ve designed, tested and built for proof purposes, a number of supercomputers whose capabilities we did little to publicize, in fact we were an audience of one, for to use the marketing process that IBM uses, would require an army of Patent Attorneys, a luxury only IBM has, ensuring it’s size, but also insuring it’s internally competitive nature: patenting when you are NOT the original author of many technologies you patent creates an enforcement nightmare. This may be deemed a “narcissistic” view, that is, no sharing of technology, but we wanted an answer to the problem, rather than to permeate the world with never ending expansions of technology moving forward in baby steps like IBM has with it’s Deep Thought, Deep Blue, Pacific Blue and Ascii Blue exercises.

By late 1994, we had determined that two courses would be pursued:

a) Type A Massively Parallel Supercomputer. CONCENTRATION OF PROCESSOR COUNT DENSITY and BUS SPEED.

A massively parallel supercomputer is just as it’s name suggests: a massive aggregation of many smaller CPUs, harnessed in a manner intended to provide as fast communication and arbitration between them as possible. The key: arbitration, since you can build fast communications, but arbitration of many users to a single resource can produce the proverbial tortoise in a race to produce a “hare”. We’d learned by several mistakes made by IBM in attempting to copy some of my earliest work back in the 1980’s on Proteus and Aerosphere. They simply do not understand granularity and their idea of wide band arbitration is mass brute force overkill.

And so we decided to focus on what would be a good granule size. Our solution probably vexes IBM, because we decided that a single Granule was most efficient if it had three processors. However, a Tri-processor design lacks redundancy, leading to chaos when heavily loaded. So, we altered our thinking to fix on a Four Process, Fixed Memory Size Granule, with Virtual Memory (in a traditional sense) and a unique method of resolving arbitration (which being proprietary, I won’t share here) both in terms of resource sharing, resource excluding and resource synchronization.

This course required us to identify the processor mask for the “granule” we intended to use to build the ultimate “massively parallel” supercomputer, which in our case, resulted in selection of the INTEL IA-64/128 PA-RISC superscalar processor with 2M of Cache or larger, along with 1 Gigabyte of dedicated 333-400 MHz double data rate ECC Memory using dual channels and 64 Bit wide data bus for transfers and addressing.

A single granule today consists of a large board implementation of:

x- Four ITANIUM-2 bare chips, 4M Cache each
y- 4 Gigabyte of shared dedicated DDRam
z- One HyperSystems Chipset including

an East Bridge with four dedicated OC1000 Optical Busses

    a West Bridge with Dedicated 133MHz 256Bit Wide Advanced Graphical Bus
    a North Bridge with SMP 1.9b Quad APIC and instruction trace sharing     and hyperthreading support
    a South Bridge with PCI/x Bridge and PCI Bridge for locally attached controllers

And a manager/supermanager for four granules consists of:
a- 16 OC1000 Optical Busses
b- One Granule
c-   Optical Twisted Double Helix Net Connectors (2x) for connection to up to three other Managers and a supermanager.

Replicating this pattern into a geodesic like array of 4x16x64x256 Granules processors, makes the basic 256 processor Supercomputer a reality (with 1108 CPU chips inside). The 84 manager and supermanager granules more than compensate the system to allow all 256 processors to run at 86% of full speed or higher. And by adding a Superframe Manager, of 4x16 processors (64 more) we can support up to 16 such geodesic like arrays in a single system, without additional management, making 4096 CPU Granule Supercomputers a reality.

Naturally, since we use an internal limited overhead, minimum distance (meaning: minimum conflicts and minimum arbitration multiplexing overhead), we can continue to expand the number of processors. Estimated on the basis of the net performance of a single CPU Granule being somewhere in the vicinity of 2 Billion instructions per second average (with peak speed of 3.6 Billion instructions per second in a burst), the coaxial speed of the 4096 Granule unit is thought to be 8000+ Billion Instructions per second average and 14400+ Billion Instructions per second average during peak bursts. The interesting advance we made in the design, was to re-equip each granule with a Pre-Fetch / Post Op cache of an additional 256Mbytes of bipolar memory, that serves to cache elements most frequently needed to reload the cache of the processor, and use the highest speed transfer mechanism of each processor to reload it’s cache in the event of cache penalty states, resulting in the CPUs running at 99% of their rated SPEED at all times. That, coupled with the highly efficient TDH MDNet architecture, allows us to retain 99% of the computation capacity of each CPU. This compares very favorably to the 44% (and declining) rate of CPU execution retained by IBM’s current top of the line supercomputer, allowing us to achieve more than a 4:1.6 higher performance ratio, per unit of Central Processor speed, leading to a more powerful system at the price, of about 2.5 times scalar, than IBM is capable of producing.

However, our research also produced a startling discovery: we could achieve ALL THAT PERFORMANCE with a single CPU Chip, if only we built one using something other than Silicon dioxide and other conventional Transistors. Such led to the consideration of a second architecture, the ‘Type B’.

B) TYPE B Linear Hyper-Accelerated Supercomputer. REPLACEMENT OF MAS PAR DESIGN WITH HYPER ACCELERATION SEMICONDUCTORS (NAST)

After a while it became apparent that the complexity of larger and larger arrays of CPUs in a Supercomputer (in the case of the design above, 1360 Itanium 2 CPUs are used…) causes an exponential expansion in the complexity of software to service the system, we decided to investigate an early experiment in physics, the Near Insulator Near Conductor Bimetallic Electron Trap I’d demonstrated to the physics community in the mid 1980’s.

The advantage of using a device tottering on the brink of conductivity and insulation, is that such devices can be made to settle in one state or the other using very little quantum energy, for example, the mere presence of a trapped electron, can tilt the semiconductor scales in one direction or another, in such experimental devices.

So, I elected to start experimentation on building a 16 boolean function chip suitable for ALU uses on single, nybble, byte and word streams of data, in 1994. As luck would have it, the first successful example, the nAST Oscillator, was ready by the spring of 1996, by mapping it’s transient states to “and”, “or”, “nand”, “nor” and other Boolean transfers, using a small array of such devices, using dual photo-electronic inputs and an LC diode as a semaphore output, we were able to build a successful 32 bit wide 16-Bool chip that could process two data streams in comparative mode, and produce a result, at the fastest possible rate that optics is capable of transmitting data, somewhere in the 500 Terahertz range. That meant that such an ALU would be capable of register-register operations at about 500,000 Giga-instruction-cycles per second if full streaming rate were sustainable.

Because such a supercomputer would be processing so much data in so short a period of time, I invented a new architecture, the “instruction execution streaming processor”.

The idea behind the IESP, is that programs consist of compilations of streams of execution code, and code loops that are packeted with tags associating them with their associated Tasks, and then “streamed” in long data transfer queues into the Supercomputer, their results combined based on any interdependencies, gated by dependency state switches, allowing an array to interoperate without reordering for the purposes of state dependency.

The results, an array of 16 of the Boolean function chips that could accept a page of up to 1,000,000 instructions, process them, and move to the next page, with all memory and cross instruction references resolved, in a single “multi-clock” cycle, allowing up to 500 million such pages to be streamed at this array per second.

The astonishing results, a single small processor, of only 16 mid scale complexity chips and a large input array buffer, all fabricated using nAlkane Silver Thiozole, that could execute 500,000 Billion-instructions in a single second so long as we could sustain the fetch cycle for the programming.

In tests, we have since added a 43 chip Boolean array that processes 128 and 256 bit wide complex arithmetic (the original 16 chip array was more than capable of simulating integer arithmetic, without major modification).   This “floating point arithmetic” simulation processor also achieves a nearly sustained 500,000 Billion instruction per second rate during burst program transmission operations.

The two, comprise the most complex elements of the Type B Linear Supercomputer.

We have as of yet to design the input program fetch, page faulting, input output management and communications processors to support this CPU.

However, one thing of note is that power input to the devices is only .25 volts and less than 1 microwatt of power consumption at all times. The device produces little heat, only on the photo-optic side, and the resulting conclusion of the nAST Supercomputer experiment, suggests that we are but a few million dollars away from a breakthrough, a Supercomputer of a few hundred mid-complexity nAST gate arrays, that could easily be reduced to a single chip, capable of ½ Million Billion instructions per second sustained execution rate, so long as programming was capable of delivering same to it.

For a Single CPU Linear Supercomputer, this breakthrough is very dramatic, because it eliminates the need to use three quarters of a million 1 GHz Itanium 2 processors, or more, just to achieve the same performance.

The cost reduction that the Type B Linear Supercomputer represents is equally dramatic, along with the power reduction.

At the time of completing the initial NAST project, we had come to conclude that the best direction for research dollars, was to further commercialize nAST Semiconductors. With the prototype test of the Type B Linear Supercomputer, we proved that not only was such thousands of times less expensive, but could yield a SINGLE CPU capable of outgunning the fastest CMOS Silicon Transistorized CPUS arranged in a Massively Parallel Array.

And, clearly, once miniaturized and commercialized the Type B Linear Supercomputer CPU would be able to replace today’s CPU Chip, as a single CPU system, or even be incorporated into some futuristic Massively or Mid-scale Parallel computer that used multiple NAST SUPERCOMPUTER CPUs.

We as of this time, had not standardized on a logical instruction set for the Type B, but were considering incorporating simulations of a wide variety of existing CPU instruction sets, from available types, including x86, Risc86, IA64, Sparc and Itanium 2, so as to provide the best overall housing for future software developers, wishing to avail themselves of these codes, without the limitations forced on them by the design of their CPU chips. Such, of course, would require expansion of the architecture to serve multiple arrangements of registers, but due to the extraordinary speed of the Type B, we do not feel that would be a problem.

C) TYPE C STORAGE SUPERCOMPUTER.

We also evolved the notions of a “Storage Supercomputer”, which using basic, nAST dual-triodes, implemented terabytes of storage in single board design, so that Disk Drives could be used only as hard storage backup devices.

The exceptional speed (1 Million Gigabytes/second) such a Type C SC would provide, would allow greater overall streaming rates to the central processor of the Type B Linear Supercomputer, eliminating one of the great foibles of modern computation, the great overhead of the mechanical disk drive as an extension to Random Access Memory.

One of the primary concerns of the Type C, was to equip the unit with File Service Acceleration, eliminating the need for repetitive storage retrieval and update, by attaching the File System to other Supercomputers, while containing it locally and it’s service units locally in the Type C. This “channelized Storage egress” greatly reduces overhead in the main supercomputer connected to the SC, by retrieval of the actual record needed and by providing all the file system maintenance and overhead needed by the main supercomputer. That more than quintupled the performance of our test vehicles, so long as we did not under-equip the design of either the API between the Type C and the Type A, nor saddle the unit with poorly conceived drive topologies – for which we invented a UNIVERSAL HARD DRIVE STORAGE TOPOLOGY (UHD/ST) that could accommodate any and all other file systems as mere “conveniences” of applicability within the Type C, for application compatibility, e.g..

IN CONCLUSION

In our opinion, semi-exotic approaches like nAST and the Type B Linear Hyper Accelerated Supercomputer, are not only feasible, but are better spent moneys, than continuing on with current, slow, CMOS processors in a Massively Parallel Array.

If emulating human intelligence ever becomes a necessity, say, in simulacrum robotics, a massive array of Type B Linear HA Supercomputers can be assembled into a single “brain box” and applied to that end application.

In the meanwhile, the Type B satisfies every requirement there is.

Used for Workstation, Server, Supercomputer, even a scaled down version for handheld, would bring an entirely new dimension of computation to business and industry, accelerating it nearly 1 million fold.
_____