I lead a team that manages a service organization, supporting multiple teams reliant on infrastructure and database services. In my opinion, reducing time to delivery for new features and functionality is the largest business challenge that we face in my industry. Improving operational stability and consistency is another challenge that we face as data warehouse professionals, along with working to reduce risk of unexpected delays when rolling out infrastructure changes (both hardware and software).
Operation teams have historically tended to be conservative and slow moving. While these traits may help to reduce risk, it typically comes at the cost of speed to delivery for new features, hardware, software upgrades, and supporting new products. By contrast, at TripAdvisor we have a “speed wins” culture that prioritizes getting quality work done quickly. Through investment in database and data-warehouse initialization and configuration and deployment as a first class “infrastructure as code,” we allow our data service organization to be able to more effectively be able to scale at the speed of the business without sacrificing any operational rigor and operational standards.
The availability of cloud providers gives individual teams the opportunity to fulfill their own infrastructure on demand. While these services offer tremendous benefits of elasticity and reduction in spin up time; there is some risk of creating silos across teams that manage their own infrastructure. Initial time to market is reduced but not all teams are going to have the same level of operational rigor supporting cloud services. It is likely that teams that go around their respective operations organizations will need to pay similar costs in terms of automation once they reach a certain scale. A hybrid solution with both cloud provided and on-premise in partnership with internal operation is likely to provide the best outcome from a speed, security, and service to the larger product organization. In short, cloud can be a part of the solution, but without changes in how we think of our infrastructure internally, it’s not by itself a solution to all of these challenges.
Improving operational stability and consistency is another challenge that we face as data warehouse professionals
Traditional database services organizations’ slowness is often due to reliance on operations team personnel to accomplish tasks through manual tweaking and configuration. At TripAdvisor, our investment in automation and ‘infrastructure as code’ allows our service organization to be able to more effectively perform software and hardware upgrades without incurring delays or impeding the business. We’re leveraging multiple open source configuration management tools to achieve these goals. More specifically, we leverage Puppet for underlying OS configuration and Ansible for higher-level application service configuration. This investment was born out of the need to manage ever-increasing amounts of infrastructure without comparable increasing in staffing. By focusing initially on some quick wins, we were able to show some value to this approach while iterating towards more comprehensive solutions.
From an operational rigor perspective, the benefits are clear. With infrastructure state checked into source control and deployed automatically, we eliminate systematic differences between individual components of a system. Large-scale changes in configuration can be rolled out in a managed and controlled fashion to a subset of systems. Performance signatures can be compared between different states. New systems can be brought online in minutes rather than days, which lets us—as a service-oriented organization—be able to exceed our customers’ expectations in terms of time to delivery. Immediate benefits are apparent during large-scale hardware refresh cycles. We have reduced the time to initialize and bring into service a refresh of a datacenter from weeks to days, but one can see the benefit even on a small scale with individual new product facing requests.
Testing best practices that software engineers have taken for granted now more easily apply to our infrastructure and configuration. Infrastructure as code can be validated locally on a developer’s workstation using containers or virtual machines. Automated tests can be written and changes can be validated in continuous integration. Breaking changes are more likely to be found as they are committed rather than during release, which increases stability and reduces outages. Through investment in automation and configuration management, we’ve made progress in all of these areas. We are able to deliver more stable, more consistent changes to our customers and are able to reduce the risk in making changes. This in turn allows us to move at the speed of the business customers that we support. Keeping up with the pace of business is an ever-changing challenge for operations. We’re pleased with the progress that our teams have made in this area and look forward to new challenges to be solved in the coming year.