As I understand it, to achieve true concurrency on a single computer, you need to ensure that app code sits in one of the core-private caches (L1, L2, but not L3). Thoughts
2025-01-16
that sounds like an untrue scotsman to me ๐ but maybe you can clarify how you're defining "true concurrency"
...the ability to execute more than one program or task simultaneously...
In a non-cache CPU, it never happens that tasks run at the same time. They only appear to do so because context-switching happens so fast as to fool us humans. There's only 1 CPU and it gets time-shared between threads.
In multi-core CPUs on the same chip, the distinction becomes blurred. If the code runs entirely in the private cache, with no cache misses, the code runs (truly) concurrently. Programs can run "at the same time", limited by the number of cores available. But, the cores block and wait if the code needs to access shared cache or shared main memory. The hardware determines the synchronization. The software code gets no say in this - it just tries to access memory and may get blocked by the hardware.
In my mind, L1/L2/L3 caching is just a kludge driven by 1950s desires to share memory on time-shared, single-threaded machines. These days, using bowls full of Arduinos would be a smarter choice if one wanted simple (true) concurrency. It seems to me that the software world is being presented with more and more complicated syntaxes for gluing asynchronousity on top of synchronous languages which run on top of already-asynchronous electronics.
FTR: I don't consider processes that take themselves out of the picture while waiting for async I/O to be "running" (synchronously, nor concurrently). Sitting in memory whilst executing no instructions is not "running" in my vocabulary. In contrast, operating systems bestow the state named "Running" to any process which isn't "Blocked", yet, might not actually own CPU time and is not executing opcodes. In my book, any process which is (truly) "running" needs to be in charge of a CPU (or a core). Another way to put it: if you have 1,000 processes and fewer than 1,000 cores then the best you can do is to simulate true concurrency. That's the basis of "threads" in all programming languages that I know about - simulation of concurrency, not concurrency.
In my mind the fundamental problem is that, by using hardware to do low-level sync at the memory-access level, we take design decisions out of software artchitects' hands. Something like very explicit message-passing would be better (not the Smalltalk kind of "message passing" but the internet kind of message passing). I just don't like hidden, under-the-hood, "surprise" blocking where some other process can determine my process' run-time. Hidden blocking is OK if you're building "calculators", but, not so OK if you're building internet-y (sequencing) software where you want total control/expression of all latencies and running times.
See synonyms, usage & more
In a non-cache CPU, it never happens that tasks run at the same time
that's not necessarily true, though, right? there's nothing that requires a cache to allow concurrency, and there's not a fundamental reason you couldn't, it's just that they both tend to be present together in modern cpus
In my mind the fundamental problem is that, by using hardware to do low-level sync at the memory-access level, we take design decisions out of software artchitects' hands
Wasn't this exactly the shift that approaches like itanium were trying for, and didn't get traction?
... nothing that requires a cache to allow concurrency ...
Correct, but concurrency - by definition - requires separate CPUs. Caching is just an attempt at decoupling cores without actually using distributed CPUs, whilst continuing to do what we've always been doing...
"Concurrency" is just a mis-use of a word from the English language. It would be more accurate to call it "time-sharing".
... and didn't get traction? ...
I think that MOP (message-oriented-programming) is necessarily on the horizon, due to the shift in our problem space, from 1950s single-threaded CPUs and building computation-based calculators to today's internet-y, robotics, IoT, etc. thrust.
I need to refresh my memory of what Itanium was attempting to do, but, I suspect that it tried to accommodate synchronous-language-think, which ain't the right way to approach asynchronous problems (and would explain failure ; on top of which, it was probably plagued by the "if we asked people what they wanted, they'd have said faster horses" effect).
Caching is just an attempt at decoupling cores without actually using distributed CPUs, whilst continuing to do what we've always been doing...
I think we disagree there... I see that as a side-effect of caching, perhaps, but not at all fundamental to the need or implementation
"Concurrency" is just a mis-use of a word from the English language. It would be more accurate to call it "time-sharing".
I think you're redefining what you're calling "true" concurrency. Sure, you can keep drilling down deeper in the stack in search of your truth, but at what point do you stop? I don't choose to stop calling something concurrent just because there's also a need to arbitrate access to shared hardware resources sometimes.
I think there's a reason we've built these abstractions and choose to program on them, rather than try to count cycles and account for discrete hardware unit requirements at a software level
but concurrency - by definition - requires separate CPUs
And also don't agree with that on modern cpus, either... given pipelining has become pretty ubiquitous, each cpu/core is often also concurrently executing different stages of multiple instructions
(though again, there are limits to that imposed by the fundamental physical limits of the hardware, so again we let the hardware arbitrate that, and benefit from it when we can, but recognize that contention for resources can also cause pipeline stalls)
None of this fundamentally contradicts your conclusion that maybe we should be moving towards a message passing model, but I don't feel like this is a compelling argument for it in my opinion, either
The Cerebras wafer-scale processors have no shared access to memory (no cache). Each core can only access it's local memory. Planning the message routes between individual cores becomes one of the challenging parts of programming it.
yeah, I recall the early parallela manycore designs were similar... either the programmer, the tools, or some combination of both had to "understand" the implications of interconnects to work well. In the ideal case they could outperform "classic" models by a big margin, but in the more realistic case of "semi-skilled programmer throws code spaghetti at the naive port of gcc's backend" the results were way worse.
ia64 was a different approach, it was much less about moving away from shared memory models, but it also was an attempt to shift the "understand hardware implications" to the software side
and I suspect if the backing of the Intel/AMD rivalry wasn't sufficient funding to make that make sense... not going to say it's not possible, but also going to be very skeptical of any "quick fixes"