Gonna say this one time.
I don’t post this stuff because I am looking for someone to tell me there’s not a scholarly article, or a deeper dive. I post it because of trends I have seen in the reporting of day to day events, and emerging threats.
I didn’t get here on scholarly articles on emerging threats
As this one gets closer to being truly weaponized… You need to know that SPECTRE and Meltdown cannot be patched..
I mean, I hate it too… all my life has levels been focused on one version of x86 or another.
But SPECTRE is not a ghost. It is real. It can do damage.
@thegibson omg Intel's response, "there's no problem here, as long as every developer compiles their code to assembly to find and correct timing vulnerabilities" 😂
@datatitian @thegibson Compiles their code to assembly? Eehh...we're talking about timing side channel attacks here. Compiling to bytecode gives the target at least the chance to winnow and chafe by randomizing the time each bytecode takes to execute. Compiling to raw assembly is probably the worst thing you can do under these stresses.
@vertigo @thegibson I wasn't being facetious, Intel literally published a guide that instructs you you compile to assembly and edit it to avoid these attacks: https://software.intel.com/security-software-guidance/secure-coding/guidelines-mitigating-timing-side-channels-against-cryptographic-implementations
@thegibson we have to tools to solve these problems, but I’ve had little luck convincing anyone with the resources to get it done that it’s real.
These two are just the beginning, and as long as we rely on static logic we’ll have computers that can’t be fixed.
I wrote a (weirdly patriotic?) post about using FPGA to solve this and many other systemic vulnerabilities our computers have, but I’m not sure how to push it forward.
The issue isn't static logic. The issue is divorcing instruction decoding from instruction set design to attain performance goals not originally built into the ISA.
It takes, for example, several clock cycles just to decode x86 instructions into a form that can then be readily executed. Several clocks to load the code cache. Several clocks to translate what's in the code cache into a pre-decoded form in the pre-decode cache. Several clocks to load a pre-decode line into the instruction registers (yes, plural) of the instruction fetch unit. A clock to pass that onto the first of (I think?) three instruction decode stages in the core. Three more clocks after that, you finally have a fully decoded instruction that the remainder of the pipelines (yes, plural) can potentially execute.
Of course, I say potentially because there's register renaming happening, there's delays caused by waiting for available instruction execution units to become available in the first place, there's waiting for result buses to become uncontested, ...
The only reason all this abhorrent latency is obscured is because the CPU literally has hundreds of instructions in flight at any given time. Gone are the days when it was a technical achievement that the Pentium had 2 concurrently running instructions. Today, our CPUs, have literally hundreds.
(Consider: a 7-pipe superscalar processor with 23 pipeline stages, assuming no other micro-architectural features to enhance performance, still offers 23*7=161 in-flight instructions, assuming you have some other means of keeping those pipes filled.)
This is why CPU vendors no longer put cycle counts next to their instructions anymore. Instructions are pre-decoded into short programs, and it's those programs (strings of "micro-ops", hence micro-op caches, et. al.) which are executed by the core on a more primitive level.
Make no mistake: the x86 instruction set architecture we all love to hate today has been shambling undead zombie for decades now. RISC definitely won, which is why every x86-compatible processor has been built on top of RISC cores since the early 00s, if not earlier. Intel just doesn't want everyone to know it because the ISA is such a cash cow these days. Kind of like how the USA is really a nation whose official measurement system is the SI system, but we continue to use imperial units because we have official definitions that maps one to the other.
Oh, but don't think that RISC is immune from this either. It makes my blood boil when people say, "RISC-V|ARM|MIPS|POWER is immune."
No, it's not. Neither is MIPS, neither is ARM, neither is POWER. If your processor has any form of speculative execution and depends on caches for maintaining instruction throughputs, which is to say literally all architectures on the planet since the Pentium-Pro demonstrated its performance advantages over the PowerPC 601, you will be susceptible to SPECTRE. Full stop. That's laws of physics talking, not Intel or IBM.
Whether it's implemented as a sea-of-gates in some off-brand ASIC or if it's an FPGA, or you're using the latest nanometer-scale process node by the most expensive fab house on the planet, it won't matter -- SPECTRE is an artifact of the micro-architecture used by the processor. It has nothing whatsoever to do with the ISA. It has everything to do with performance-at-all-costs, gotta-keep-them-pipes-full mentality that drives all of today's design requirements.
I will put the soapbox back in the closet now. Sorry.
@djsundog @requiem @thegibson I distinctly remember when the first round of SPECTRE and Meltdown attacks came out and everyone and their grandmother were heralding the technical superiority of ARM cores because they didn't have a successful demonstration of these attacks.
It only took several months of effort to demo the first attack for the ARM.
Then, POWER became the patron saint of processing. And, as I recall, not long after, its fortified walls fell eventually as well.
You can absolutely get to the moon from here if you have enough bandaids. But, I'll argue that there are easier ways to do it than creating a big, gooey stack of padded rubber strips carefully balanced on each other.
It might be easier to just pack two computers in the box that communicate via a simple(!) bus, with everything else strictly separate (no shared memory, no shared storage, etc). Security critical code ends up on the smaller of the two units and any insecure code can request it to do things but never measure the details because the communication granularity is from request to reply. (remember to put any power management into the secure side of things or the insecure side can gleam information off from there)
Still needs some care (e.g. constant time implementations) but it's much easier to reason about.
The biggest concern would be that it won't take long before chip vendors end up putting them into the same package again "because we made it secure, pinky promise!" (which is how Arm TrustZone and the various Intel initiatives work - and fail - these days)
Which reminds me - aren't you building a computer that also features a simple communication channel? ;-)
@sjb @requiem @thegibson That works for some workloads. Consider GPUs for example. However, for other workloads not so much. A more general approach would be a fleet of small processors interacting with each other over communications links each with their own private memory. This cellular approach to computing is something that was envisioned back in the days of SmallTalk, but never fully realized. The GreenArrays GA144 chip is probably the next incarnation of the idea, but it's application domain appears to be limited to deep embedded to applications.
I don't claim to know a general purpose solution to this extremely general purpose problem. However knowing the true reasons why it exists in the first place is critical in knowing how to mitigate it, at least for specific domains.
Yes _and no_ ... RISC as a microarchitecture thoroughly and definitively won. There's a real argument to be made that a CISC ISA with a RISC microarchitecture is the true performance winner. (Cf. X86, GPUs)
And of course, as you say, our RISCs aren't as RISCy as they could be these days, either.
@requiem DNS isn't resolving from here... or for either of the "down for everyone or just me" sites i tried
kc5tja@pop-os:~$ dig jasongullickson.com
^Ckc5tja@pop-os:~$ dig -n jasongullickson.com
; <<>> DiG 9.16.6-Ubuntu <<>> -n jasongullickson.com
;; global options: +cmd
;; connection timed out; no servers could be reached
SPECTRE depends on changing between user and kernel modes of operation. The idea is to exploit failed speculation into kernel space. Under these conditions, you're still running in user-space, but the caches now have privileged information in them. How much depends on which paths were speculated in the kernel, and flushing those cache lines in favor of new user-mode content takes time. Hence, the timing side-channel.
With a compiler for a VLIW architecture, this can't occur, because speculation never happens across a privilege boundary. The cache is always hot with the working set of the process currently running.
I read the paper about that latest spectre variant and it looks like their whole lfence-bypassing attack relies on a secret-dependent indirect branch after the bounds/permission check (and the lfence), and to my best understanding, if that indirect branch was a retpoline, the attack would no longer work.
Am I missing something? I can't believe they haven't thought of such a simple mitigation...