Experience Series: Operations
This post is part of my Experience Series where I share opinions I have formed from building companies and running product. Each posting is a break down of a company function. This post focuses on operations specifically.
No one ever plans for systems to stop working, ensure you have back-ups and that they are tested on a regular basis.
- About a month after launching PassiveTotal, our primary database node ran out of disk space, causing everything to halt. In a panic, I was on the command line trying to free up space only to accidentally typo and remove a portion of our journal file. Database, corrupted. Backups, none. Complete failure that could have been avoided in so many ways. Fortunately, we were barely off the ground and most people were understanding.
Back-up locally and to some cloud-based service that supports strong, multi-factor authentication.
- After the database deletion with PassiveTotal, I went above and beyond on backups, so that we’d never have an issue like that again. Cron jobs were set to backup key portions of the database on every hour, every day and every week. Each of these remained encrypted locally on the machine and were then securely posted to a cloud service with 2FA enabled.
- With PassiveTotal, we never ran into a major issue like we did the first time, but we did encounter cases where database nodes would become out of sync, causing data inconsistency. Having snapshots at multiple times from multiple databases was helpful in piecing together the proper database.
General operations rule of thumb––bad things happen at the worst possible times, just get used to it. Always have a means to fix a problem, no matter how remote you are––or better yet, train someone else.
- I was flying to Memphis in order to see Steve when one of the PassiveTotal application front-ends failed. I had just landed in Chicago and needed to make my connection, but also try and fix the problem. I managed to SSH into the serve from my phone, work some magic to get the service back up until I could properly troubleshoot.
- Attending a conference on the beach, I ran into one of my NinjaJobs customers who jokingly said I should fix my service if I am going to be out on the beach. Turns out I had introduced a bug in my last commit, and it was impacting their job posting. I ran back to my room, identified the issue and pushed the fix to production.
Use local or 3rd-party services to track the health of your product and company resources. If your customer is your way of finding out something is wrong, you are in a bad position.
- Even with numerous services monitoring your application, this can still be tough to get right. To this day, I will still occasionally find a customer or friendly-user letting me know something is off with one of my products, only to find a monitor had a gap or alert was routed to the wrong location. It’s rare, but it still happens. Keeping your services healthy 24/7 is a tough business, especially when you continue to grow.
- For PassiveTotal and NinjaJobs, I liked using Datadog to track the health of our services. The agent would look at general host performance––network, disk, IO, memory––and application performance––key services running, processes in place, etc. All alerts were set to trigger on Slack, email and via text message.
- For Backscatter.io, I wanted to take a more manual approach in order to save money, so I went with Nagios. It’s a lot more work to manage your own local monitoring, but if you intend to spin up hundreds of servers, it ends up being worth it for cost savings.
Testing your application is a pain. Consider using 3rd-party services to automate integration and functional tests in order to save time and catch bugs before your customers do.
- To this day, I am poor at writing great test harnesses. In fact, if I am not making money on a project, I will seldom write them. For the 95% of the time when no issues are caused, I feel great, but it’s that 5% introduction of bugs that breaks the application that really make me wish I had those tests.
- With PassiveTotal, we didn’t introduce much in the form of tests until after RiskIQ purchased the company. Immediately, we put integration test coverage in place with GhostInspector. This was helpful for ensuring key processes like login, registration, running a search, and getting to settings, all functioned properly.
- At RiskIQ, we eventually rewrote PassiveTotal and can now happily say that the platform has great test coverage––no thanks to me, but still a big win.
- Once I saw the magic of integration testing within RiskIQ, I ported that same logic over to NinjaJobs and automated the same key process analysis. This suite of tests has saved me over and over from introducing a breaking bug into the platform.
Assuming you are working with more than one person, have a means to communicate amongst each other. Ideally, send your first-pass alerts there and fall back to email.
- At the smaller end (less than 10 people), I think Slack is the best fit for these communications. The cross-platform nature is great and they integrate with nearly every service. Slack was invaluable when Steve and I ran PassiveTotal.
- As a company grows and more people are added, it’s easy to ignore a Slack alert or filter out email. Phone calls can be ignored too, but are more effective alerts when an outage takes place. When I worked at Facebook, they had a system that would escalate calls from you (the committer) all the way to the CEO pending no one answered their phone––someone always answered before that.
Being able to teardown and stand-up your services at will with no business impact is as glorious as it sounds. It takes a lot of work, but it’s worth it when you run into issues later on. Notice I said “when” and not “if”.
- When I first wrote PassiveTotal, I had no idea all the ways you could automate system deployments. We started with a single node that ran the whole company and slowly branched out to have multiple web frontends, multiple backend database nodes and tons of services that were called by the primary application. Anytime we had issues, I was rebuilding by hand with a checklist, hoping I got everything right. Fortunately, we never hit any major snags due to this, but man, it was a lot of time wasted.
- In building Blockade.io, I wanted to try out building an application that existing completely in the cloud with no true server presence. Completing that project was magical––the setup eventually became completely automated using AWS APIs. There’s nothing cooler than pressing a button and watching an application manifest itself from nothing but a configuration file.
- For Backscatter.io, I went one step further in my configuration building. Aside from automating the provisioning of machines, I also automated the installation of the service and logging components using OS packages. Install steps normally done by hand were all automated with tests inside of a Debian package. Automating the provisioning process is one of the most satisfying feelings.
Logging should be baked into your application, but ensure those logs flow to a central location that you can query.
- When I initially wrote PassiveTotal, logging was baked into the application through simple statements. This was better than nothing, but not highly effectively for measuring application performance or troubleshooting at scale. Over time, we migrated from internal logging over to using a combination of open source performance monitors and paid services.
- As a general security tip, be sure to avoid leaking sensitive information within your logging––think passwords, usernames, tokens, etc. In a development environment, you may want to have these, but they should never go full production.
Split your corporate website and status pages from your application. If you product needs to go down for updates, it should not knock your business offline.
- All of my older applications used to bake the corporate website within the application itself. While convenient, it was annoying to manage and made it difficult to communicate to users during maintenance periods. These days, I favor hosting the primary website in something like S3, using an “app” subdomain for the core application and outsourcing the status page to a paid service.
Patch your machines, apply updates, do all the normal work we all love to heap onto everyone else. You start to get a sense for how annoying this is when you update and break your production build.
- Managing PassiveTotal brought on a good dose of security medicine. I can’t count how many times I just decided to update our server without checking versions, only to find the application broke or specific logic no longer functioned like it did before. After breaking production several times during patch cycles, we instituted a process to stand-up development machines, perform patching tests and then roll-out during planned maintenance periods.
- Container technology has come a long way since some of my early projects. If I were to build anything today, I would implement the use of containers just to maintain consistency in my deployment process.
If people are paying you for a service, chances are high you agreed to some SLAs. Don’t breach your contracts, keep your systems up and running.
- As indicated by blurbs mentioned above, PassiveTotal went down a number of times in our early days. Most of our customers were understanding and didn’t bother to use this against us. As the product has grown within RiskIQ, it’s a different story. There’s an expectation of maturity and service levels being met. When they aren’t, you can expect a customer to request a discount, or credit to their account for the lapse in delivery. It’s annoying as no system is ever perfect, but a reality you need to live with.
Implement measuring tools inside of your application stack to identify hot spots or areas that could be further optimized. Establish a baseline and chip away at the performance updates when you are waiting on idea feedback.
- In the early days of PassiveTotal and NinjaJobs, we didn’t have any great application performance monitoring. As the services grew over time––both in data and users––these monitoring services became invaluable. Bad code that worked for a handful of users would eventually fall apart at scale, resulting in a degradation of performance or an outright failure despite no code changes. Using application performance monitoring allowed us to not only gain insight into why a failure took place, but let us become more proactive by addressing hotspots before they become failures.
If you’re constantly pushing out new features or changes to your product, invest in a fully replicated development environment.
- PassiveTotal has grown in complexity since RiskIQ purchased the company. These days, changes are being made to PassiveTotal every day, with pushes sometimes happening several times a week. Testing in production isn’t an option. In order to ensure builds are being tested beyond what’s been automated, a separate development environment has been stood up. This staging environment lets internal users “dogfood” the application prior to a production release and also offers up a good test bed for our provisioning processes.