William Leonard
Hey everyone. Welcome back to the Atlanta Startup podcast. My name is William Leonard and I’m your co-host. Today, we have a fun and informative conversation for you all, between Valor Ventures’ general partner, Robin Bienfait, and one of our newest portfolio CEO Doug Neumann. And by hearing these two talks, I am so confident that you will learn the ins and outs of building more resilient cloud infrastructure for your respective company from two of the most well-informed minds in the industry. A little bit more context on Doug. He leads Arpio , which is a startup that I actually discovered back in the spring of 2021. I saw the company on LinkedIn, reached out to Doug, we had a great first conversation and ultimately Valor ended up leaving their seed round. And since then, Robin has joined the board of Arpio and the company is legitimately off to the races. We’re excited about the solutions that Doug is bringing to market and happy he’s here with us on the podcast today.
Robin Bienfait
Thank you for joining me. This is our webinar with Arpio, with Valor Ventures. I’m a general partner of Valor Ventures and our webinar today is Defining Disaster Resilience Infrastructure for Web 3.0. It’s really about resilience. So Doug, as an expert in this space I’m talking about, you give us your background and intro, and then I’ll share mine. Why don’t we get started with you?
Doug Neumann
Yeah, sure. My name is Doug. I’m one of the founders and the CEO of Arpio. Arpio is resilience for the AWS cloud company. We have services that help companies ensure that they are never down for long when they go down. My background is that I’ve been a software engineer forever. For longer than I’d like to admit. I’ve worked at big companies like Microsoft, at startups before, and I’ve dealt with outages across the spectrum. In particular, at the last company, I ran a software engineering organization for telecom, and we had a lot of stuff in AWS. And Amazon had a five-hour outage in the middle of the day one time. And it was a very uncomfortable experience. It was very painful. Yeah, but that actually got us into Arpio. That really was the genesis of this company.
Robin Bienfait
You have a passion for this space. I have also got a background in telecom. I was at AT&T, and Bell Laboratories, went to Blackberry, and then worked at Samsung. This is a sweet spot for me because I am passionate about not only being able to be resilient, not only understanding the disaster recovery landscape but the business continuity as well as we move forward with new technologies. Some of the things that I’ve been part of were, of course, 911. Katrina was a big one. But some of the ones that are hard, and what a lot of people don’t see, is how these affect the client and customer. As we move to the cloud and more businesses and services and critical needs are concentrated in the cloud, it becomes even more important to understand what we are doing about being resilient and having a business continuity in the new cloud environment.
I was reading somewhere that 40% of critical services have already moved their main run and operation to AWS. I wanted to talk about outages today because I think that’s top of mind for both of us.
Doug Neumann
Sure. Let’s do it.
Robin Bienfait
So which one comes to mind for you first, that you guys have been working on?
Doug Neumann
Right. Well, I’ve got a few stories in the can that I can share with you. The first one is a story about a company called Code Spaces. Code Spaces was, at one time, an up-and-coming SAS business competitor. The companies like Github and Gitbucket. They had a boutique service around offering developer solution source control management solutions for highly regulated organizations. They had all kinds of marketing material about their business, continuity, Satcher, their posture, and how elite they really were in this. They woke up one day and they found that somebody had hacked into their AWS account and that person had notified them that they were on the inside and demanded a ransome to leave. So they said, well, let’s just lock them out. They began the process of shutting down their access to the account. Unfortunately, they weren’t fast enough. The attacker had installed a back door and they used that back door just to go delete everything in the account. All the servers, all the databases, and most importantly, all of the backups of all those systems.
Robin Bienfait
And they probably started with the backups. Didn’t they?
Doug Neumann
I don’t know that level of detail on it. What I know is that three days later they had to close down their business. You take a company that’s worth tens, maybe hundreds of millions of dollars, and imagine that evaporating. Because they just hadn’t done the right things to protect themselves from all the various disaster scenarios.
Robin Bienfait
And what do you do actually, when you see things like that. This is not the only one I can imagine. There are probably multiple examples in this space. I used to get customers calling in, crying on the phone, ”My business is down,” and you’re kind of thinking to yourself, So what was your backup strategy? What was your continuity strategy? You can’t get into the middle of that when they’re seeing their business close before their eyes. It’s a pretty stressful situation.
Doug Neumann
Yeah. It is totally catastrophic. The reason why this story has always stood out to me is that so many of the people that we talk to are focused on platform resilience and how do I protect my business in case my cloud provider has an outage. But the most existential threats to companies are cyber events. Whether it’s what I just described there, a bad actor getting into your environment, or a ransomware attack. There’s another story that I often tell people. It’s about an event that happened at Cisco in the WebEx team, where they had actually parted ways with an employee. I don’t know the circumstances there, but . . .
Robin Bienfait
A lot of the employees are insiders at one point in time. They know all about everything that’s going on, right?
Doug Neumann
Yeah. They’d failed to shut down this guy’s access apparently to the AWS environment.
Robin Bienfait
You wouldn’t believe how often that happens.
Doug Neumann
Yeah. And especially if you think about how complex these platforms are. You can do all kinds of stuff around single sign-on to make sure that the front door has a single point of entry and that you’ve done the right things to lock down access for those people and eliminate access as people are leaving an organization, but there are all kinds of backdoor, API tokens, and other sorts of mechanisms. This guy took advantage of that to go in and delete 456 virtual machines out of the production environment. They had a two-week outage of the WebEx team’s application. They luckily were able to restore the service. Two weeks later. It’s better than Code Space was able to do with theirs. But if you just say,
Robin Bienfait
Yeah, but how much revenue did they lose during that period?
Doug Neumann
Revenue loss? Yeah, totally. In the value of that business. You think that there are people in the organization, single points that have the ability to get into your environment and completely undermine all the IT systems in there.
Robin Bienfait
Well, and one of the stories I shared with you is, we were just doing an upgrade to one of the services a long time ago, and it was all the 800 services. Somebody literally in their protocol deleted the whole infrastructure. Now the telephony continued to work, but we had no record of who owned those 800 numbers for a long period of time. And so you can’t bill them. There’s no billing that occurs. A lot of people don’t understand that, once you’ve made that mistake unless you have some way to build it back, have that resilience to restore it, you’re just left empty-handed.
Doug Neumann
Yeah, you are! I think another thing to think about as people are transitioning from traditional on-premise workloads into the cloud is cloud workloads are entirely virtual. On-premises. Generally a mixed lot of virtualization. But the cloud platform is one unified platform. Whereas, on-premises, you’re running different hypervisor platforms and different vendors are providing networking platforms. What that means is that there is some amount of isolation you get just because of the fragmentation on-premises. You move into the cloud and you now have one credential that gives you access to do all kinds of things. It does become a big vulnerability for the resilience that you can provide, especially when we’re talking about cyber resilience.
Robin Bienfait
For a lot of people, even if they still have those on-prem solutions, the cloud offers them an alternative space for continuity. They can actually build and structure an environment in the cloud in the event that they have to recover and not have to do an on-prem recovery if that’s not possible.
Doug Neumann
Yeah, sure. There are a lot of solutions that help people do that. It’s still a heavy lift, but we see a lot of companies that are trying to figure out how do I not have to pay for a second data center, hundreds of miles away, that I hope I never have to use.
Robin Bienfait
And have a true cloud as an option.
Doug Neumann
Yep. But when we’re talking about these cyber disasters, the other class of disasters you see in the cloud are these platform outages. And before we come back to what happened last month and AWS, and talk about that, these happen. These outages do happen a few times a year. They’ll have a major outage in AWS in particular. The other story I was going to tell you about was one that happened just over a year ago. It was the day before Thanksgiving. Amazon was scaling up some services to be able to support the black Friday shopping that was going to happen. By adding capacity to one of the services, they triggered scalability limits in that service. That caused that particular service to go down for 17. It took them 17 hours to recover from this outage. The interesting thing is this service is called Kinesis. It’s a data streaming service. It’s foundational for all kinds of other services in AWS. Not only is Kinesis affected, but the CloudWatch metrics service is affected. When metrics aren’t working, then things like auto-scaling aren’t working. One vulnerability scale bug, effectively, in one service, suddenly creates a massive outage for thousands, millions of businesses. There’s nothing they can do but sit down and wait for Amazon to fix it.
Robin Bienfait
And at that point in time, they probably already know for every minute that they’re not available, how much revenue they lose.
Doug Neumann
Some of them certainly are. Yeah. I think that, in general, the lesson that I often take out of that one is twofold. One is the cloud is so powerful, but so complex, that you don’t understand how the butterfly flapping its wings over here might generate a cyclone over there kind of thing. Really, I think the people, even on the inside, the cloud environmentalists, don’t understand how all these dependencies stack up. The other thing is that these cloud platforms are so massive. There’s no way to scale test them before you go into production. We see, with new releases or scale-up events, that it’s the first time that the providers will realize that they haven’t built a solution that actually works for the scale that they’re being driven at.
Robin Bienfait
Do you think that these types of outages affect people that are only leveraging high availability?
Doug Neumann
Well, yeah, certainly there are different classes of failure modes, and the Kinesis one, I think, definitely impacted companies who have more cloud-native cloud-forward workloads. The irony of this is if you just take some virtual machines out of your data center and plop them down in the cloud, and it’s a very static environment for the most part, then you aren’t vulnerable to that Kinesis outage. You aren’t actually vulnerable to, well, one of the three outages that happened last month in AWS, the big one, but once you start really taking advantage of these higher-level services and these auto-scaling capabilities and anything that has to interact with the platform, dynamic capabilities of the platform, then you’re vulnerable to outages of the platform. That is where you find companies having these tear-your-hair-out events periodically.
Robin Bienfait
Just a question because maybe, I’m not as cloud-native, as some folks are on the call. But when you’re dealing with Amazon Web Services and they’re having an outage such as this or that, or is there only action after this is to kind of give you a credit back on your account that you lost 17 hours of activity on, but they don’t actually help you with the business impact that you just took?
Doug Neumann
Well, yeah, so certainly they have SLAs and they do have some financial penalties associated with those with really large customers. It’s pennies compared to the actual loss of a business that happens during these events. Effectively, they’re going to refund you for the service that they weren’t able to offer. But they’re not going to compensate you for the business that you were not able to transact during that period. That is very much on you. And, and honestly, they tell you, you should have built a more resilient workload. We never promised that outages would never happen. And we gave you techniques to leverage. Had you leveraged all of those techniques, then maybe you wouldn’t have been impacted by their failure.
Robin Bienfait
Let’s talk about the real kind of what happened to AWS last night, last month. There was headline news from the Washington Post of, we had three AWS events– and I call them an event–but they were outages. So, you know, that’s pretty significant. It’s “Are we on the rise for more of this because of all the new capabilities and the change and the number of people that are now participating in this environment.” Are we going to start seeing more outages, you think?
Doug Neumann
Well, I don’t think we’ll see less. Certainly, as they grow these cloud platforms, they become more complex. As workloads move into the cloud, they grow to be larger in scale and complexity and scale begets problems, unfortunately. So outages are not going away. There are a lot of people talking in the aftermath of last month about how is it that we as an industry step up to build more robust and more resilient workloads. To take the things that we are running in the cloud and figure out how to enhance them. So that the next time Amazon has a seven-hour outage, you might be able to restore your service within a few minutes.
Robin Bienfait
When people think of high availability, they think they’ve got resilience or disaster recovery capability. And it’s really not. Even though you’re in cloud infrastructure and you’re thinking there’s some replication in there that’s just natural, it really isn’t a disaster recovery strategy.
Doug Neumann
Yeah. The HA and DR are different problems. And you really need to solve both. Some level of HA can mitigate some of the risks that DR might also mitigate, but no high availability solution is going to help Cisco recover from a disgruntled ex-employee deleting 456 servers. You still have to invest in those. You have to invest in your ability to recover from those things, to get your service back online quickly. And HA seems to fail at the wrong time. There were plenty of companies that thought that they had a high availability solution in place when Amazon lost a data center for a few hours last month but then figured out that there was a single point of failure within their workload. They learned about it during the outage. So, DR is certainly still an investment that is just as relevant as it has always been.
Robin Bienfait
So, how does Arpio help with this? How do you and your team help? I know you’re passionate about this. I’m passionate about this, too, because I’ve seen in so many companies that all it takes is one outage and they’ve lost their clientele. They’ve lost a big chunk of their business and/ or confidence. The customer confidence has failed. You know, how are you helping that? Where is it that you come in to help solve this problem?
Doug Neumann
So, we go back to the fact that there are two classes of disaster scenarios. There are platform outages where your cloud platform is down, or, worst-case scenario, it could disappear for the long term in certain black Swan kinds of scenarios. The other is these cyber events. First, the way that you protect yourself or make sure that you can recover in the face of any cloud outage is to be able to run your workload in a different part of the cloud. That’s not impacted by the outage. AWS is made up of 24 distinct regions around the world, and each of those regions are architectured to be completely independent so that an outage on one should never cascade to another. What we do for that scenario is we can actually forklift a running AWS workload and move it into a different part of the cloud, where it can continue operating when it’s impaired in the primary environment. We can do that in minutes. So you’d have some downtime, but you’re back up and running within minutes, as opposed to waiting hours, maybe days, for the provider to fix the problem.
Robin Bienfait
That gives you the resilience that you really need to keep your business running.
Doug Neumann
It does. Yes. And fundamentally companies need this. The question is, are you going to go build this yourself? Or are you going to leverage a third-party solution? It’s really not strategic for most organizations to spend months, maybe years, of their engineering time figuring out how to Institute a multi-region redundancy strategy. What we do with Arpio is say, “Well, we can turn that on for you within minutes. You can go back to making the investments that actually increase revenue and make your customers happy.”
Robin Bienfait
I like that because you’re bringing in an expert that focuses on this area that some people don’t think of as a competitive advantage, but it is because what you’re offering your client is that peace of mind that if there is an event, their business isn’t impacted. And if it is, it’s minimal.
Doug Neumann
Yes. And we talked about that in the context of cloud outages, but the same thing applies to cyber risks. Like If you get a ransomware attack like Kronos did. Ultimate Kronos Group is a big payroll provider. I think I read that they actually finally have their core services. But they had a ransomware attack in the middle of December. On January 22nd, they finally had their core services restored, which means 8 million employees in the country were not getting their paychecks through their payroll system. I’m not sure how their employers were handling this stuff, but that’s pretty devastating. So those ransomware events, you have to be prepared for. To recover from them. You have to make sure that you can restore your data and the entire environment that surrounds it in an isolated non-compromised non-infected environment. That’s the other thing that we do. We can take that workload and roll it back to a previous point in time before the ransomware then happened and deploy that into a different security realm, which has not been compromised as part of the cyberattack, and get the service back online again within minutes,
Robin Bienfait
Which gives you the option of not having to pay the ransome.
Doug Neumann
Of course! Yeah. Saves a lot of money there.
Robin Bienfait
I wonder if they paid the ransome. Did they pay the ransome in this scenario?
Doug Neumann
I don’t know. I have not read that.
Robin Bienfait
I remember the Colonial Pipeline issue. They paid for it and they still didn’t recover fast enough.
Doug Neumann
Yeah. I think oftentimes that people pay the ransome and then find that the recovery technology that they’ve been given doesn’t actually work. All that said, I’m not sure the details of that Kronos scenario, but it’s just top of mind as these things have been happening over the past month and just how catastrophic these events can be. And, you talk about reputational damage. I can imagine a lot of people who use the Kronos service and have not been using it for the past month are choosing not to renew those subscriptions.
Robin Bienfait
Anything you want to leave the people that are watching our webinar today? Any thoughts that you want to leave him with?
Doug Neumann
I just guess I’d say, don’t forget that disaster recovery is just as relevant in the cloud as it has been on-premise. Whether it’s an outage of your cloud provider or protecting yourself from cyber events, a lot of people like to focus on how do I keep the bad guys out, but the bad guys keep getting in. When they get in your recovery capability is that last line of defense, the most important piece of it. You have to consider both of these things as you’re running critical workloads that underpin the value of billion-dollar enterprises every single day. And DRr is just a small investment in an insurance policy to make sure that your business will always be able to survive any kind of unfortunate event.
Robin Bienfait
I think of it as also keeping your commitment to your customers, to be always available.
Doug Neumann
Exactly. Yep.
Robin Bienfait
All right. Well, thank you very much for sharing all of this with us today. I’m so excited to be part of this journey with you. If there’s anybody out there that wants to connect with us, just give us a shout and we will connect with you either online or I can get you to Doug if needed. Excellent. Doug, it’s been wonderful chatting with you and sharing stories, and thank you very much for your time.
Doug Neumann
Thanks for having me, Robin. Really appreciate it.
William Leonard
Hey everyone, It’s William. Thanks for tuning in. I hope you enjoyed that conversation. And, if you did, I would love to invite you to come to the live conversations as we are recording them. So, this takes from Robin and Doug was part of a webinar series we have called Valor Visionaries, and you can sign up on our events page at valor.vc/events/. So, if you enjoyed it, drop me a note this week at the Atlanta Startup podcast on LinkedIn, and let us know what you thought about the episode.