- cross-posted to:
- technology@lemmy.world
- cross-posted to:
- technology@lemmy.world
Archived version: https://archive.ph/eSuy1
A few months ago, an engineer in a data center in Norway encountered some perplexing errors that caused a Windows server to suddenly reset its system clock to 55 days in the future. The engineer relied on the server to maintain a routing table that tracked cell phone numbers in real time as they moved from one carrier to the other. A jump of eight weeks had dire consequences because it caused numbers that had yet to be transferred to be listed as having already been moved and numbers that had already been transferred to be reported as pending.
“With these updated routing tables, a lot of people were unable to make calls, as we didn’t have a correct state!” the engineer, who asked to be identified only by his first name, Simen, wrote in an email. “We would route incoming and outgoing calls to the wrong operators! This meant, e.g., children could not reach their parents and vice versa.”
A show-stopping issue
Simen had experienced a similar error last August when a machine running Windows Server 2019 reset its clock to January 2023 and then changed it back a short time later. Troubleshooting the cause of that mysterious reset was hampered because the engineers didn’t discover it until after event logs had been purged. The newer jump of 55 days, on a machine running Windows Server 2016, prompted him to once again search for a cause, and this time, he found it.
The culprit was a little-known feature in Windows known as Secure Time Seeding. Microsoft introduced the time-keeping feature in 2016 as a way to ensure that system clocks were accurate. Windows systems with clocks set to the wrong time can cause disastrous errors when they can’t properly parse timestamps in digital certificates or they execute jobs too early, too late, or out of the prescribed order. Secure Time Seeding, Microsoft said, was a hedge against failures in the battery-powered onboard devices designed to keep accurate time even when the machine is powered down.
“You may ask—why doesn’t the device ask the nearest time server for the current time over the network?” Microsoft engineers wrote. “Since the device is not in a state to communicate securely over the network, it cannot obtain time securely over the network as well, unless you choose to ignore network security or at least punch some holes into it by making exceptions.”
To avoid making security exceptions, Secure Time Seeding sets the time based on data inside an SSL handshake the machine makes with remote servers. These handshakes occur whenever two devices connect using the Secure Sockets Layer protocol, the mechanism that provides encrypted HTTPS sessions (it is also known as Transport Layer Security). Because Secure Time Seeding (abbreviated as STS for the rest of this article) used SSL certificates Windows already stored locally, it could ensure that the machine was securely connected to the remote server. The mechanism, Microsoft engineers wrote, “helped us to break the cyclical dependency between client system time and security keys, including SSL certificates.”
Simen wasn’t the only person encountering wild and spontaneous fluctuations in Windows system clocks used in mission-critical environments. Sometime last year, a separate engineer named Ken began seeing similar time drifts. They were limited to two or three servers and occurred every few months. Sometimes, the clock times jumped by a matter of weeks. Other times, the times changed to as late as the year 2159.
“It has exponentially grown to be more and more servers that are affected by this,” Ken wrote in an email. “In total, we have around 20 servers (VMs) that have experienced this, out of 5,000. So it’s not a huge amount, but it is considerable, especially considering the damage this does. It usually happens to database servers. When a database server jumps in time, it wreaks havoc, and the backup won’t run, either, as long as the server has such a huge offset in time. For our customers, this is crucial.”
Simen and Ken, who both asked to be identified only by their first names because they weren’t authorized by their employers to speak on the record, soon found that engineers and administrators had been reporting the same time resets since 2016.
In 2017, for instance, a Reddit user in a sysadmin forum reported that some Windows 10 machines the user administered for a university were reporting inaccurate times, in some cases by as many as 31 hours in the past. The Reddit user eventually discovered that the time changes were correlated to a Windows registry key in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\SecureTimeLimits. Additional investigation showed that the time changes were also linked to errors that reported valid SSL certificates used by the university website were invalid when some people tried to access it. The admin reached the following conclusion:
TLDR: Windows 10 has a feature called Secure Time which is on by default. It correlates time stamp metadata from SSL packets and matches them against time from the DCs. It processes these various times by means of black magic and sets the system clock accordingly. This feature has the potential to flip out and set the system time to a random time in the past. The flip out MIGHT be caused by issues with SSL traffic.
Other examples of people reporting the same behavior—for example, here and here—date back to 2016, shortly after the rollout of STS. More recent reports of harmful STS-induced time changes are here, here, and here.
“We’ve run into a show-stopping issue where time on a bunch of production systems jumped forward 17 hours,” one Reddit user wrote. “If you’ve been in the game more than a week, you know the havoc this can cause.”
STS primer
To determine the current time, STS pulls a set of metadata contained in the SSL handshake. Specifically, the data is:
- ServerUnixTime, a date and time representation showing the number of seconds that have elapsed since 00:00:00 UTC on January 1, 1970
- Cryptographically signed data obtained from the remote server’s SSL certificate showing whether it has been revoked under a mechanism knowns as the Online Certificate Status Protocol.
Microsoft engineers said they used the ServerUnixTime data “assuming it is somewhat accurate” but went on to acknowledge in the same sentence that it “can also be incorrect.” To prevent STS from resetting system clocks based on data provided by a single out-of-sync remote server, STS makes randomly interspersed SSL connections to multiple servers to arrive at a reliable range for the current time. The mechanism then merges the ServerUnixTime with the OCSP validity period to produce the smallest possible time range and assigns it a confidence score. When the score reaches a sufficiently high threshold, Windows classifies the data as an STSHC, short for Secure Time Seed of High Confidence. The STSHC is then used to monitor system clocks for “gross errors” and correct them.
Despite the checks and balances built into STS to ensure it provides accurate time estimates, the time jumps indicate the feature sometimes makes wild guesses that are off by days, weeks, months, or even years.
“At this point, we are not completely sure why secure time seeding is doing this,” Ken wrote in an email. “Being so seemingly random, it’s difficult to [understand]. Microsoft hasn’t really been helpful in trying to track this, either. I’ve sent over logs and information, but they haven’t really followed this up. They seem more interested in closing the case.”
The logs Ken sent looked like the ones shown in the two screenshots below. They captured the system events that occurred immediately before and after the STS changed the times. The selected line in the first image shows the bounds of what STS calculates as the correct time based on data from SSL handshakes and the heuristics used to corroborate it.
The “Projected Secure Time” entry immediately above the selected line shows that Windows estimates the current date to be October 20, 2023, more than four months later than the time shown in the system clock. STS then changes the system clock to match the incorrectly projected secure time, as shown in the “Target system time.”
The second image shows a similar scenario in which STS changes the date from June 10, 2023, to July 5, 2023.
Simen, meanwhile, said he has also reported the time resets to multiple groups at Microsoft. When reporting the problems on Microsoft’s feedback hub in May, he said, he received no company response. He then reported it through the Microsoft Security Response Center in June. The submission was closed as a “non-MSRC case" with no elaboration.
The engineer then tapped a third party specializing in Microsoft cloud security to act as an intermediary. The intermediary relayed a response from Microsoft recommending STS be turned off when the server receives reliable timekeeping through the Network Time Protocol.
“Unfortunately, this recommendation isn’t publicly available, and it is still far from enough to stop the wrongly designed feature to keep wreaking havoc around the world,” Simen wrote in an email.
Warning: STS will “bite you in the butt”
Simen said he believes the STS design is based on a fundamental misinterpretation of the TLS specification. Microsoft’s description of STS acknowledges that some SSL implementations don’t put the current system time of the server in the ServerUnixTime field at all. Instead, these implementations—most notably the widely used OpenSSL code library starting in 2014—populate the field with random values. Microsoft’s description goes on to say, “We have observed that most servers provide a fairly accurate value in this field and the rest provide random values.”
“The false assumption is that most SSL implementations return the server time,” Simen said. “This was probably true in a Microsoft-only ecosystem back when they implemented it, but at that time [when STS was introduced], OpenSSL was already sending random data instead.”
While official Microsoft talking points play down the unreliability of STS, Ryan Ries, whose LinkedIn profile indicates he is a senior Windows escalation engineer at Microsoft, wasn’t as reticent when discussing STS on social media last year.
“Hey people,” he wrote. “If you manage Active Directory domain controllers, I want to give you some UNOFFICIAL advice that is solely my personal opinion: Disable Secure Time Seeding for w32time on your DCs.” When someone asked him why, Ries responded, “Because it’s just a matter of time—wink—before it bites you in the butt.”
A Microsoft representative emailed the following statement several hours after this post went live on Ars:
Secure Time Seeding feature is a heuristic-based method of time keeping that also helps correct system time in case of certain software/firmware/hardware timekeeping failures. The feature has been enabled by default in all default Windows configurations and has been shown to function as intended in default configurations.
Time distribution is unique to each deployment and customers often configure their machines to their particular needs. Given the heuristic nature of Secure Time Seeding and the variety of possible deployments used by our customers, we have provided the ability to disable this feature if it does not suit their needs. Our understanding is that there are likely unique, proprietary, complex factors in deployments where customers are experiencing Secure Time Seeding issues and these customers do not benefit from this feature as it is currently implemented. In these isolated cases, the only course of action we can recommend is to disable this feature in their deployments.
We agree that the overall direction of technology with the adaption of TLS v1.3 and other developments in this area could make Secure Time Seeding decreasingly effective over time, but we are not aware of any bugs arising from their use. This technology direction also makes heuristic calculation of time using SSL/TLS far less attractive when compared to deterministic, secure time synchronization.
We continue to investigate how to best secure time synchronization on the Internet and welcome customer input on how to best meet their future needs.
The mystery continues
As Simen noted earlier, it’s not clear precisely what causes STS to make the errors sometimes but not always.
“This is what really strikes me as odd,” Simen wrote. Microsoft “know the field they look at might contain random data, so my guess is that their implementation breaks down when this is skewed so that most/all implementations they communicate with contains random data rather than just some.”
HD Moore, CTO and co-founder at runZero, speculated that the cause is some sort of logic bug in Microsoft code. On Signal, he wrote:
If OpenSSL has been setting random unix times in TLS responses for a long period of time, but this bug is showing up infrequently, then it’s likely harder to trigger than just forcing a bunch of outbound TLS connections to a server with bogus timestamp replies—if it was that easy, it would happen far more frequently.
Either the STS logic requires different root certificates as the signer, or some variety in the hostnames/IPs, or only triggers on certain flavors of random timestamp (like values dividable by 1024 or something).
It smells like a logic bug that is triggered infrequently by fully random timestamps (32-bit) and likely just some subset of values and with some other conditions (like multiple requests in some period of time to multiple certs, etc.).
There are other means to ensure server clocks remain accurate, Moore said:
[Clock-setting] seems like something better handled through NTP, or at least through a trusted TLS connection to a known endpoint operated by the vendor (time.windows.com and friends). The super lazy (but arguably safer) way to get a trusted timestamp is something like: ❯ curl -s -vvv https://www.microsoft.com/4040 2>&1 | grep -i ‘< date:’< date: Wed, 16 Aug 2023 04:37:31 GMT.
Second-ish precision, and if you lock the HTTP client to a short list of trusted CA roots for the target domain, pretty hard to mess with. I used something similar forever ago on Linux systems where the clock would go wrong often—set the hwclock to the HTTP response timestamp of a known good server, then run NTP, which would succeed since the clock was close enough to be within the boundary check—otherwise NTP would fail since the clock was too far off.
As the creator and lead developer of the Metasploit exploit framework, a penetration tester, and a chief security officer, Moore has a deep background in security. He speculated that it might be possible for malicious actors to exploit STS to breach Windows systems that don’t have STS turned off. One possible exploit would work with an attack technique known as Server Side Request Forgery.
Microsoft’s repeated refusal to engage with customers experiencing these problems means that for the foreseeable future, Windows will by default continue to automatically reset system clocks based on values that remote third parties include in SSL handshakes. Further, it means that it will be incumbent on individual admins to manually turn off STS when it causes problems.
That, in turn, is likely to keep fueling criticism that the feature as it has existed for the past seven years does more harm than good.
STS “is more like malware than an actual feature,” Simen wrote. “I’m amazed that the developers didn’t see it, that QA didn’t see it, and that they even wrote about it publicly without anyone raising a red flag. And that nobody at Microsoft has acted when being made aware of it.”
That’s a hideous, over-engineered attempt at solving something that has already been solved better in open standards and FOSS. I’d argue that bare NTP is more secure on merit of not trusting random TLS certs to be accurate.
EDIT: Found my comment to be too negative. I will stand by the fact that it comes across as hackish but can understand the logic in how it is supposed to work in theory, so, not stupid, just fundamentally insecure and terrible by merit of its design not paying enough attention to context.
Or as they say, every probabilistic curve ends somewhere.
If it works 999.999 time out of a million, then every millionth windows will break.
What an awful way to try to figure out the time. I mean it could at least pop a big error if, lol, the time seems off by a week!
no no, clearly we’re in 2159
Tss +/- a hundred years or two? The planet is over 4 Billion years!
In that context it’s basically accurate and the errors rounding level. Good way to stay positive!
You are right; life is short, no time for un-needed pessimism!
Cheers to you!
NTP is touched on in the article, and a quick Google shows that the largest difference NTP can correct before exiting in a panic is 1000s.
However there is an argument/flag to run ntpd once in a “just fix it” mode. So, having to use cert timestamps to “rough” the clock and allow NTP to “fine” it isn’t necessary.It does seem strange to essentially create an out-of-band/off-label/out-of-scope time management system, when there are already open standards that work well for it.
Agreed. I think that the problem that I have with this is similar to the problem of orgs in the US using an SSN as a form of universal ID. The Social Security Administration clearly states that that is not the purpose and will not provide verification because of this. X.509 certs are not meant for this purpose and their implementation does not take this use case into consideration as would be required for its use in a secure manner.
feature in 2016 as a way to ensure that system clocks were accurate
Oh, boy…
Stupid of them to use a Windows server in the first place…
If they’re going to use heuristics like that to get an approximate time, they could at least use it to validate connections to NTP servers or something that can actually sync the time properly. Get approximate time for initial sync, then contact a Microsoft server to get a more accurate time over HTTPS (which is what this supposedly meant to address), then use NTP to get accurate time and validate that it’s close enough, and only then when everything checks out, set the system clock to that time.
Sounds like the heuristic is taking multiple samples only uses them if they are within some consistency threshold, to hedge against the cases where the field has random data.
The reason it only fails rarely and randomly is because it only happens when multiple actually random timestamps happen to line up around the same time.
Sort of like how several applications (cough git cough) have failure modes when two different files happen to have the same hash.
Turns out developers are bad at statistics and probabilities and don’t understand the birthday paradox.
Hmm, the birthday problem alludes to what’s going on, except the birthday problem discards the year and the time.
If it’s 2x 32bit random timestamps that have to align within a 10 minute window (600 seconds) it’s a probability of 600 in 4.3 billion (uint32 max).
Still vanishingly small.
However, if a server makes 10 requests as part of STS, and you have 5000 servers, then there is a significantly higher chance of having to deal with the fallout.That is, of course, assuming all server clocks slip enough to trigger this, and that all STS timestamps are random 32bit.
And there might be something in the way that 32bit timestamp is randomised. As it’s part of a cryptography system, it would make sense to be cryptographically secure. But seeing as it’s not directly part of the cryptographic process, it could be a cheaper/faster RNG.The server clocks don’t actually have to slip at all to trigger this. They just have to not match up with whatever the STS comes up with as the time.
This is the best summary I could come up with:
A few months ago, an engineer in a data center in Norway encountered some perplexing errors that caused a Windows server to suddenly reset its system clock to 55 days in the future.
The engineer relied on the server to maintain a routing table that tracked cell phone numbers in real time as they were being moved from one carrier to the other.
“With these updated routing tables, a lot of people were unable to make calls, as we didn’t have a correct state!” the engineer, who asked to be identified only by his first name, Simen, wrote in an email.
Simen had experienced a similar error last August when a machine running Windows Server 2019 reset its clock to January 2023 and then changed it back a short time later.
The mechanism, Microsoft engineers wrote, “helped us to break the cyclical dependency between client system time and security keys, including SSL certificates.”
Simen and Ken, who both asked to be identified only by their first names because they weren’t authorized by their employers to speak on the record, soon found that engineers and administrators had been reporting the same time resets since 2016.
I’m a bot and I’m open source!