NitroWare.net

Please standby while the website is under maintenance. All existing content is still available to access.

 

Our experience with Rangeley and Vendor statements on the problem

When I first learnt of this issue from the tech press I was already aware I had a device with the affected processor, a pre-production 20TB NAS Pro box from Seagate dating back to 2014 and I was concerned if that thing would brick itself one day randomly leaving me in a conundrum to recover my data off the device and or get it repaired somehow. At this stage, knowing vaguely what brands were affected I was not that interested in the issue overall other than concern for my device.

At the Techleaders conference for Australian IT journalists in March 2017, I caught up with my local Seagate Representative to discuss the issue, but he had not heard about it at all and asked if I would send him some links on the matter. However he did understand the ramifications and suggested we could possibly swap my unit with another just as old review unit if mine fails, but that is not a foolproof fix and does not address the issues of any customers out there with retail Seagate/Lacie NAS devices.

That was the last I heard from Seagate on the Rangeley Clock Signal fault.

Not long after the original media blasts I noticed a mention of DELL in a news report and somehow I made the connection that their line of datacentre open network switches use this processor and due to the life-cycle of those products, those too were affected and DELL themselves eventually published a technical advisory on the matter.

This is was my oh $h!t moment. As a system administrator I have several of these high end Dell enterprise switches in network I manage, handling mission critical traffic. Although these are setup to be redundant and high availability, if a core switch fails, this represents a major outage on the network one way or another and as many admins know, falling back to a non-redundant system is risky. If any further outage happens in that state then everything is down, caught with pants down scenario?

Based on Dell’s advisory which was published around February 2017 I contacted both my Dell Sales and Tech Support point of contact for advise on how to go about replacing these switches as well as my PR contact for what Dell’s policy is for this issue beyond what is published in their advisory, which at the time was forward looking for a replacement program later in the year.

However, Dell Like other vendors involved were relatively quick to fix their devices in production, despite this family of switches having a long lead time. Switches made after March 2017 do not suffer from the fault, only ones made from 2014 to 2017.

Given I was personally exposed to two different brands with the fault and different approaches to customer remedies, I thought it would be prudent to reach out to other vendors we work with for media reviews to see how they are approaching the issue on an Australian or Global level.

I contacted my Australian PR contacts for Cisco, Netgear and Dell and received statements on the matter.

In March 2017, Dell Australia sent us the following statement

 

"Customer satisfaction and product quality are central to Dell’s business. A component manufactured by a supplier and included in limited Dell EMC networking products has a clock signal that can potentially degrade over time. No failures of these products have been reported to date. Beginning in July, Dell will proactively begin replacing impacted products that are under warranty or that are covered by a customer’s service contract.

 You can also access some FAQs regarding this on the support site here: http://www.dell.com/support/article/us/en/19/QNA44095/networking-clock-signal-qa?lang=EN"

There are several interesting items in this comment that in retrospect were honoured and some not, more on that later.

On The 3rd of March 2017, The Register published an executive statement from Netgear assigning blame and responsibly and defining a path for repair.

http://www.theregister.co.uk/2017/03/03/netgear_recalling_hardware_with_bad_intel_atoms/

 

"In a statement emailed to The Register, Richard Jonker, VP of SMB product line management for Netgear, said: "Netgear is taking full responsibility for this component supplier concern and we can state at this time that we are issuing a full-scale recall for the affected products. We will be contacting all registered owners of these products and providing a swap where they will receive a new product and the affected product will be returned to Netgear."

 

In the formal Knowledge Base article on the issue, Netgear actually did not recall their affected devices and as mentioned in the executive statement, offering repairs for registered users, ie users with devices in warranty, where as a formal, legally binding recall is broader reaching.

Note the statement says ‘component supplier’ ominously. During the period that this story broke, February to March 2017, Intel’s name was deliberately withheld from vendor communications on the issue. Intel did not formally disclose their processors were faulty to the media or public, apart from publishing an errata document listing the technical fixes to their products.

 

"NETGEAR will not be providing additional comment outside of the details outlined by the KB article below. 

https://kb.netgear.com/000037344/Service-Note-for-RN3130-RN3138-WC7500-and-WC7600v2?cid=wmt_netgear_organic"

Cisco was also identified by the initial reporting at the register, which I also followed up in March 2017 especially as Australia has strict consumer production laws on a federal and state level

https://www.cisco.com/c/en/us/support/web/clock-signal.html#~overview

https://www.theregister.co.uk/2017/02/03/cisco_clock_component_may_fail/

All we received from Cisco ANZ was the following just as vague statement without even a link to further information. Again the ‘supplier of anonymity’ as Intel have fixed the issue themselves relatively silently and compensated their customers for repairs behind closed doors under a confidentiality umbrella.

 

"Cisco strives to deliver technologies and services that exceed customers’ expectations, and meet rigorous quality and customer experience standards. We became aware of an issue related to a clock signal component manufactured by one supplier. We have worked with the supplier to resolve the issue, and we’re providing information and support for our customers."

 

Since Intel did not actually issue a statement of their own in 2017 on the matter I did not follow up with them and focused on their OEM partners. The issue can actually be avoided altogether depending on how the OEM implemented Intel’s chip. The typical design is affected, but some more elaborate or specific designs are not.

Intel did not say much on the matter apart from their technical documentation, however Intel's Robert Holmes Swan, CFO and executive vice president, stated at the time:

 

"But secondly, and a little bit more significant, we were observing a product quality issue in the fourth quarter with slightly higher expected failure rates under certain use and time constraints..."

 

As far as my media-centric investigation, I heard nothing further from the vendors I contacted for comment either directly or from other IT news media. After the initial round of reporting Feb to march 2017 the issue was ‘old news’