Staying Strong—And Safeguarding Your Data—During Sandy

We continue to think about our customers, friends and families on the east coast as they fight through yet another storm. We’re built to serve you and are here to help in any way you need. You can also make a donation to the Red Cross here.

At RichRelevance, we built our infrastructure to be a fully redundant self-healing service—one that could withstand the toughest conditions and worst disasters. Very rarely does one encounter such tests, but Hurricane Sandy provided just such a catastrophic outage to prove our systems.

So what happened during Hurricane Sandy? In short, our New York data center went from 100% online to catastrophic flooding in 90 minutes; but in spite of this, our operations and delivery to our customers never skipped a beat. Here’s the rundown of events:

  • Utility power failed at 4:09 PM PST on Monday, October 29, but redundant power generators took over with no loss in service.
  • When the storm hit, our datacenter lost four internet peering services, but because of our diverse IP circuits, we continued to serve into the peak of the storm.
  • At 7:34 PM PST, we were notified that the sub-basement was taking in water and it was not known if this was correctable. Four minutes later, they called for evacuation. The sub-basement had flooded the diesel tanks, and over the next hour flooding continued up 1.5 levels—to five feet above the first floor lobby.
  • At 7:39 PM PST, we were given one hour to backup data and shut down all servers, removing our data center from production. We completed back up in 20 minutes. Post-facto, they were able to stay up for 12 hours on hand-delivered diesel to allow graceful shutdown of all customers. Kudos to our partner!
  • The vendor evacuated 100% of the building after refueling the generators one last time.
  • The datacenter was fully shut down on Tuesday, October 30 at 7:31 AM PST and remained down until 11:24 PM PST that day, when the storm passed and generators were re-started with hand-delivered fuel.
  • We remain on generator power (as of November 8) and expect to remain so for another week until Con Edison restores utility power and replaces transformers. The data center vendor will then fully test all equipment and replace any damaged systems prior to switching back to utility power sources.

Amidst the chaos wreaked by Sandy, {rr} continued serving recs at below 80 ms.  When the data center went down, we gracefully failed over to Chicago and Virginia with no loss of service or data. Recommendations, promotions and ads continued to be flawlessly served at sub 80ms, with no visible impact to our merchants, brands or shoppers.

The technology and infrastructure that kept our services seamlessly performing through this catastrophic outage remain the backbone and foundation for our Black Friday/holiday infrastructure.

We are ready for Black Friday—from disaster recovery to scaling—because RichRelevance has a fully redundant self-healing data service. Each of our datacenters runs as a separate standing replicate of the others. All of our front end data centers (6 in diverse US locations, 2 in Europe) are geographically load balanced and failover to a secondary and tertiary datacenter.

In further preparation for Black Friday/Cyber Monday and the upcoming holiday season, we doubled down on our IT infrastructure:

  • We refreshed over 70% of our servers to the latest technology, expanding services in all front end and back end data centers.
  • We increased by 10x our backend processing speeds for feeds and models.
  • A temporary “peaker” datacenter in Seattle was put in place to address West Coast shoppers. The term “peaker” is from the electrical industry, where “peaker” power plants are run when there exists peak demand for electricity. We do the same with data centers, bringing them up to guarantee our service during the peak holiday shopping season. We also have a flexible contract for additional “peakers” in Virginia and Amsterdam if needed; another datacenter can be brought online within 48 hours.
  • We upgraded a data center in the EU, creating more diversity in our services through all new servers and a higher performing vendor.
  • We’ve scaled to take in more than twice the traffic of last year.

Any one of our data centers can handle 100% of the world’s traffic into {rr} by itself, but to provide the industry’s fastest response time at 100% uptime, we over-engineer and build highly redundant services. In fact, we’re entering the holiday season with the fastest recommendations response time ever—a 65ms average in the last 10 days.

So, while we at RichRelevance hope that our customers, friends and family remain strong and safe during these storms, rest assured that we will keep your services the same—strong and safe.

Share :

This post was written by Kevin Duffey

ABOUT Kevin Duffey
As Vice President of IT Operations, Kevin brings more than 30 years of engineering, operations and management experience from both the public and private sector to RichRelevance. He is responsible for meeting RichRelevance’s customer needs by aligning business requirements with IT operations, and managing complex budgets and vendors. Under Kevin’s leadership, the IT team maintains 100% uptime and performance of fourteen global datacenters that deliver more than 1 billion recommendations each day on more than 500 retail websites. These results are due in part to continuous IT innovation using the latest technology, including multi-tenant Hadoop leveraging Hive, Parquet, Spark, Hbase and Cassandra. Prior to RichRelevance, Kevin worked in the public, private and entrepreneurial sectors for diverse companies such as the General Services Administration, AARP, Sonopia and Marketo.
Related Posts