Disaster Recovery – WHY TESTING IS KEY
Organisations BIG or SMALL must understand their backup systems and test the data is obtainable and the process works thoroughly. This article describes how the DMV in America was crippled for nearly 2 weeks while their system were rebuilt. While no data was compromised and the system had been tested before – things were still learnt about their business continuity.
Disaster Recovery and the DMV: Testing is Key
In late October, a systems outage at the California Department of Motor Vehicles (DMV) lasted roughly two weeks, and stemmed from the “cascading failure” of three hard drives that overwhelmed the agency’s network — despite it being decentralized and using two independent hardware systems, a state official said.
Preceded by an unrelated “issue” on Oct. 19, the three hard drive failures on Oct. 24 sparked an event that persisted until Nov. 7, although most offices were restored days earlier.
No data was compromised or lost, but the outage affected thousands of customers and 122 of 188 DMV offices in the nation’s most populous state. For a time, more than half of DMV offices were unable to process driver’s license, ID or vehicle registration transactions.
Such an outage is “rare,” said Jessica Gonzalez, the DMV’s assistant deputy director of Public Affairs, adding that online services were unaffected during either incident. Still, the department is working with its state partners “to closely analyze this issue, and determine best steps moving forward to ensure an outage like this does not occur again,” Gonzalez told Government Technology via email.
The DMV’s system was built with disaster recovery and business continuity in mind, and utilizes two hardware systems at two separate locations. Its hard disks are industry standard and current-day technology. The department handles its own backup, and conducts two technology recovery tests per year, in coordination with the state Office of Technology Services.
Each of the DMV’s hardware systems has failover and redundancy built in by having primary and backup systems in the same hardware chassis. And overall, the system was built to withstand the loss of multiple hard disks in either the primary or the secondary system, Gonzalez said — but not both.
The failure was complicated, she said, because “the systems in each of the physical locations have, in the intervening time period, transitioned to production usage for both … breaking redundancy and weakening the disaster recovery process.”
Gonzalez called it “a series of events that the department has not previously witnessed.” But she and several of her counterparts nationwide said it’s one for which they are constantly vigilant.
As part of the agency’s disaster recovery process, staffers worked around the clock to bring the DMV’s system back online, rebuilding and reprogramming 190 processors from scratch. And to hasten service availability, offices that run on dual processors (because they process so many transactions) were restarted on single processors first.
While agencies in many other states have so far avoided a widespread outage, they continue to plan for that possibility. Take Virginia, where Department of Motor Vehicles CIO David Burhop said the department has had situations where an individual service such as license renewals have gone down, but not anything that took a spectrum of services offline for several days.
The state backs up its 90-plus departments through Northrop Grumman, and maintains separate primary and backup systems for its data, Burhop said, housing the primary system south of Richmond and the backup system in the southwest part of the state.
“They definitely wanted a separate power grid, weather zone, and so they went pretty far out and away from the shore in terms of hurricanes, etc.,” Burhop said, noting that the Virginia Information Technologies Agency acts as a liaison between Northrop Grumman and executive branches of the agency.
Hardware includes servers from Dell and EMC, and an IBM mainframe.
“Knock on wood, they are some of the most reliable pieces of hardware that I have ever worked with,” Burhop said of the mainframe. “The great thing is, your average hacker today — when’s the last time you ever heard, or have you ever heard of a mainframe getting hacked?”
To make sure departments are prepared for a service disruption, the commonwealth conducts an annual drill to test all system alerts and all notification trees as if a disaster had just occurred. The most recent edition was about a month ago, Burhop said, over a period of three to four days.
The test began at around 3 p.m. on a workday, and the DMV — which isn’t a “Priority One” department — was fully “restored” by the end of the following day. And an actual customer disruption, of course, never happened — in large part because customers have made it abundantly clear they like to be able to access their DMV at all times.
Testing also is commonplace at the Texas Department of Motor Vehicles; components are tested annually, and portions of the system are subject to random testing — the so-called “point of sale” systems at county offices coming up for review next year.
Recently, said CIO Eric Obermier, more rigorous testing requirements have made departments that don’t meet data restoration timelines subject to retesting in 90 days.
Almost a year ago, the department, which hasn’t experienced a widespread outage either, upgraded its system from running Natural and Database to Linux and DB2 — in part because techs who speak the former are getting hard to find.
In Illinois, where the DMV’s data backup and recovery is handled by the Secretary of State’s office, a test data recovery is conducted every year, typically in the spring, said Data Systems Administrator Gary Dameron. During this time, the office tests the DMV’s enterprise mainframe and rotates testing of different distributed systems.
“Testing is key,” Dameron said. “Testing of not only your DR [disaster recovery] solution but of your backups. Even though we’ve been doing this for a lot of years, every year we test, we find something new.”
Dameron said the DMV has facility servers in the field that communicate back to its main data center, the Storage Area Network. That’s its primary storage, but its backup is a newer virtual tape system — storage disks that emulate tape.
Obermier also noted that maintaining multiple power sources for hard drives is a key to data preservation and recovery. That way, he said, “the failure of any one of them is not linked to the other.”
In Oregon, DMV spokesman David House said that smaller outages happen regularly, but are almost always fixed within “a couple hours.”
“It’s usually a telecommunications issue, somebody cuts a fiber cable somewhere,” he said.
Oregon DMV backs up its data daily through a mirror agreement with another state, though House declined to say which one.
“If we had a complete loss like the system caught fire, we can just copy it back over from this state,” he said, noting that he doesn’t know how long that would take because it’s never happened before.
The department has been computer reliant since the late ’60s, and its core systems are COBOL-based with modern hardware. the Oregon DMV has had a disaster recovery plan in place for “a decade or two,” House said, and updates it every year or two. The plan covers everything from disaster recovery to staff contacts, and getting systems back up and running.
Recently the department simulated a disaster recovery in conjunction with The Great Oregon ShakeOut, a local example of the multinational earthquake preparedness drill.
Informed of the extent of California DMV’s outage, House called it “an extreme rare and unlikely event.”
“That’s hard to plan for,” he said. “They’ll probably learn from that.”
On the whole, Virginia DMV’s Obermier likened information recovery and disaster recovery to buying insurance — and said both are typically difficult to fund until an actual crisis hits.
“People don’t like insurance until they need it and they’re more likely to go without it until they need it, and then they’ll never question it again,” he said. “In the state of California, it’s probably not going to be very hard to go and get money to make that system more resilient.”
LIFE IS NOT ABOUT WAITING FOR THE STORM TO PASS, IT’S ABOUT LEARNING HOW TO DANCE IN THE RAIN