I recently built a 12th Gen PC, expecting an upgrade to 13th Gen will be a cheap and significant upgrade path soon. Now there isn't going to be any way to know if a second-hand CPU is damaged in this way.
I bought a $500 13th gen CPU that destroyed itself, replaced it (and didn't keep the dead CPU) with a $500 14th gen CPU that destroyed itself, and spent another ~$500 on related hardware and dumping Intel stuff to go AMD to get a working system. I also spent a lot of time trying to resolve the problem. I'd bet that I'm not the person burned worst, because someone could very easily have replaced their motherboard or memory or power supply unit in the hopes of fixing the issue, as any of these could have looked like potential causes, and there'd be no way for anyone to prove to Intel that this was the cause even if Intel intended to reimburse for these.
Maybe, I might get $500 back at most if Intel reimburses for the 14th gen CPU; I'd assume that at best, based on what they've been doing so far, that they'd send out another Intel CPU (which I no longer have a use for, having gone AMD).
And I was mostly using this system for fun. While I was corrupting my root filesystem regularly at boot at the end, I ultimately didn't -- as far as I know -- suffer any serious data loss or expense from the data that the processor was corrupting. My system was mostly to be used for my own entertainment. I didn't miss deadlines or lose critical information.
As Steve Burke has pointed out in earlier episodes on this, there are people who have been impacted by those secondary costs, some of which might make my own costs look irrelevant.
He was talking to video game companies who were using affected processors as well as having customers who were affected; they had apparently banned some customers for cheating because they knew that the internal state of the game was incorrect; they couldn't figure out what the customers were doing, but knew that their game state was being modified. It apparently wasn't the customers cheating, but their CPU, which had partially destroyed itself, and was now corrupting memory.
Another had been using CPUs for video game servers and those kept dying and taking down service; another company estimated that they'd lost $100k in player business due to the problem.
Apparently these were also popular, due to high single-threaded performance, with hedge funds that do stock trading. I imagine that a system that suddenly stops working or corrupts data can very quickly become extremely expensive in that context, far in excess of what the CPUs cost.
OEMs who build and sold systems containing these CPUs had apparently been taking back systems and repeatedly replacing parts; they probably incurred substantial costs and hits to their own reputation, as customers are upset with them.
Same thing with datacenter providers, who incurred a lot of costs investigating and mitigating problems, swapping parts and CPUs. One of these Burke quoted as having advised customers to use an alternate AMD-based system and if they insisted on the Intel one, the provider would charge a $1000 additional service fee to cover all the costs the provider was taking in having to deal with systems based on the CPUs. Gives an idea of what they were losing.
God only knows what the impact of having a ton of data around the world corrupted is. Probably no more than a tiny fraction of the problems related to corruption will ever actually be attributed to the CPUs themselves.
And I don't know how many systems out there may not be fully-tracked -- so they don't get updates to avoid the problem -- and have the CPUs built into them. Industrial automation hardware? Ship navigation systems? Who knows? All kinds of things that might fail in absolutely spectacular ways if they work for a period of time, then down the road, eventually start corrupting data more and more severely.
I mean, Intel might, at best, provide a cash refund for a dead CPU. But they aren't gonna cover losses from secondary problems, and there's no realistic way that most businesses and people who bought these could prove them, anyway.
Buying the last CPU they made before this clusterfuck occurred is maybe one of the best things you could have done and still be indirectly affected, as you got a reasonably fast system that wasn't directly affected -- if I'd known about this in advance, rather then Intel not saying anything, I'd have purchased a 12th gen CPU happily rather than another $1k in useless hardware and spent a ton of time to try to resolve my problems. You'll have the option to, at upgrade time, go AMD or 15th gen Intel and LGA 1851, if you want to hope that Intel's 15th gen is more solid than their previous two. Just means a new motherboard and, if you're using DDR4 memory, you'll need to toss that and buy DDR5.
If your CPU is crashing/unstable then yes, damage is already done, but for the few of us who bought these later just update your bios to the latest one, set intel defaults, do not overclock (I have even undervolted it a bit, but ymmv) and wait for the microcode update.
Though I do wonder if Intel isn't just stalling for time, I do hope they are not. Didn't wanna touch my build for next ~5 years.
That is, disappointingly, not sufficient to guarantee avoiding damage. I set all that in the BIOS using my first processor (13900KF) before ever inserting my replacement processor (14900KF) into the motherboard. The replacement processor still destroyed itself.
Processor 1 used only motherboard defaults and managed to destroy itself.
Processor 2 used only Intel recommended settings, no XMP memory profile, no Intel turbo boost, more conservative than motherboard defaults, and also destroyed itself.
I did not try running a processor for its lifetime at minimum memory speed or with only 1 core active. It's possible that that might be sufficient to avoid damage. If I hadn't already gone AMD over this, and had to use a processor from the affected generations, that's what I'd be doing now until Intel comes out with their update. Not gonna do much by way of fancy gaming, but at least the system's usable and hopefully won't destroy itself.
Real shit, can you sue them for this? I mean they aren't stopping selling them even knowing they are faulty. Seriously, how can you get your money back from these vultures.
Vote with your wallet and don't ever get anything from this piece of shit trash ass company again. What a joke. They aren't even stopping selling them KNOWING there's an issue. Wish I had money to sue the fuck out of them.
What, if anything, can customers do to slow or stop degradation ahead of the microcode update?
Intel recommends that users adhere to Intel Default Settings on their desktop processors, along with ensuring their BIOS is up to date. Once the microcode patch is released to Intel partners, we advise users check for the relevant BIOS updates.
I destroyed my second CPU, a 14900KF, while having already been aware of that recommendation, and having disabled all of the settings like that that the motherboard vendor had enabled by default prior to ever inserting the replacement CPU, and only used the CPU with those settings; it still destroyed itself, like the first. I am very confident that you can still destroy a CPU having done that.
That isn't to say that using conservative settings is a bad idea (and maybe doing something further, like running memory at minimum frequency, not just using the Intel recommended default rather than the motherboard vendor defaults, might actually manage to reliably avoid CPU damage). But I am confident that just running standard Intel recommended settings is not, alone, enough to avoid damage.
There's no 100% way until the new microcode is released next month. All affected CPUs are at risk of silicon degradation by the excessive voltage.
The are some power limits and July bios updates you can use that Intel says can help reduce the damage or prevent it entirely in some scenarios. I believe the damage is specifically caused by single threaded spikes, so reducing LLC and running something like prime95 in the background might hold the voltage low enough that it won't happen. But there is no fix yet, so if your CPU is susceptible, running it will degrade the CPU, at least until the fix is out.
If you can avoid using a new one, I would. I would not buy or use an unused 13th gen or 14th gen Intel CPU until Intel completes their updates.
In my case, there was a period of time where I had an old, damaged 13th gen CPU, and a new, unused 14th gen.
I was always able to use my damaged CPUs without problems as long as I booted up Linux and told it to use only one core (maxcpus=1 on the GRUB command line passed to the kernel). Even two cores enabled, and it couldn't even boot towards the end, but I never saw corruption with one.
If I could rewind time, I would continue to use my old CPU and avoid using the new one. I would add maxcpus=1 to my Linux command line (to do it every boot, edit /etc/default/grub, runsudo update-grub on Debian-family systems). And I'd use the damaged CPU on a single core until I know that Intel has a workaround in microcode, my motherboard has the relevant BIOS update applied, and then l'd swap in the replacement CPU).
If I didn't have a known-damaged CPU, just have a still-working 13th or 14th gen processor and could get by using an old desktop or laptop or something until the update is out, I'd probably do that if at all possible, so that I don't incur damage.
13th and 14th gen are literally the exact same hardware as 12th gen, but with boosted clock speeds and power requirements. Basically, intel is struggling to develop new hardware, as they’re beginning to be limited by things like atom size and the speed of light across the width of the chip. So instead of developing new hardware, they just slapped new code onto the 12th gen chips and called them a new generation.
But they made the rookie mistake of not adequately dealing with heat dissipation (which is easy to make when overclocking,) and chips are burning out.
I don't think that the voltage issue is simply heat, not unless it is some kind of extremely-localized or extremely-short-in-time issue internal to the chip. I hit the problem with a very hefty water cooler that didn't let the attached processor ever get very warm, at least as the processor reported temperatures.
Wendell, at Level1Techs, who did an earlier video with Steve Burke talking about this, looked over a dataset of hundreds of machines. They were running with conservative speed settings, in a datacenter where all temperatures were being logged, and he said that the hottest he ever saw on any hotspot on any processor in his dataset was, IIRC, 85 degrees Celsius, and normally they were well below that. He saw about a 50% failure rate.
If we hit the problem on our well-cooled CPUs, if the CPU simply getting hot were a problem, I'd have expected people running them in hotter environments to have slammed into the thing immediately. Ditto for Intel -- I'd guess (I'd hope) that part of their QA cycle involves running the processors in an industrial oven, as a way to simulate more-serious conditions. Those things are supposed to be fine at 100 degrees Celsius, at which point they throttle themselves.
So glad I spent like $2K on a computer with one of these in it that has custom firmware and BIOS on it. Guess I'm just fucked eh? Never buying Intel ever again.
If I had a known unused one, I would absolutely not use it until Intel finishes putting out their patch to motherboards to address this. You have no idea whether you could cause damage that won't be detected, leaving you with a slightly damaged processor that malfunctions occasionally.
Intel may publish guidance on how to use unpatched processors. If they don't -- they sure have not been forthcoming with information thus far -- here's my own suggestion.
When I do use it, I would, prior to booting any OS on the CPU, go into the BIOS and turn everything related to the CPU to minimal performance. Memory speed down, disable Intel turbo boost, everything. If you can disable cores there, disable all but one -- even my severely-damaged pair of CPUs could still boot without corrupting my root filesystem as long as I ran using only a single core (though two cores induced problems), and I'd take that as an argument in favor of one core being preferable, though I cannot say for sure that doing so helps avoid damaging the chip rather then just avoiding being affected by the damage once incurred.
And the first thing I'd do, booted into that minimal-performance-CPU-environment, would be to do that motherboard BIOS update. Then go back and reset the motherboard to defaults and use the thing normally.
Maybe that's over-cautious, but we know that the processors destroy themselves with use, and we have no idea what the minimum amount of time -- if any -- to incur damage is. Unless Intel can come out with some kind of diagnostic to reliably detect damaged CPUs, you won't know if you damaged your CPU in that window before the BIOS update, and it is maybe occasionally corrupting data, which I'd guess is a situation that you probably don't want to be in during the lifetime of the CPU.
I thought I read that Intel said this was from messing with voltages? I have had plenty of these processors in the last couple of years and never experienced crashes, but I don’t overclock
That was one initial theory, but it's known to not be the cause. An earlier video that Steve Burke and Wendell from A1techs did had Wendell examine several hundred CPUs that were running in servers on non-Z790 motherboards (another source of potential problems that was initially blamed) at conservative settings, known and logged temperature for the lifetime of the server (so not temperature). He still saw about a 50% failure rate.
I also personally destroyed one of my CPUs with motherboard default settings, and the other with Intel's recommended settings (less aggressive than the motherboard defaults), so I can personally attest to this not just being people running with crazy voltages or something.
There may also be other issues that people have caused by doing something else, but the elephant in the room has been narrowed down to processors destroying themselves while running well within spec.