One of the key charters of NetApp Product Tools Analytics Globalization Services (PTAGS) is to enable a service-oriented cloud platform for big-data telemetry across the NetApp product portfolio to derive analytical insights and simplify the customer experience globally.
Unique complexities require innovative solutions
Working in a big data environment involves a unique set of complexities that demand innovative solutions. Our organization, NetApp PTAGS, works in a big data Hadoop environment, producing customer-facing applications. We work across production and sub-production infrastructures as we progress toward delivering an application. A typical infrastructure is a tiered 5-layer architecture that consists of
- The underlying compute and storage resources
- Cloudera licensed Hadoop stack
- A software stack for ingesting, processing, loading, and extracting data
- Data storage components (Hadoop and non-Hadoop endpoints)
- And finally, the applications that serve up the data to the customers
Is that complex enough? How do we keep this infrastructure reliable and always on?
Aiming for drama-free releases
The challenge that PTAGS faced was to be able to make more frequent releases while meeting a high-quality bar with optimal use of resources and streamlined processes. To begin with, there was a simple ask from our senior director: “Make releases drama free.”
It was important for us to understand the impediments to achieve drama-free releases. We found out that most of the “drama” centered on the deployment window. A typical deployment window consisted of teams from IT and engineering (development, QA, and automation teams), along with release management. All teams met in 4+ hour sessions for deployment and post-deployment activities. Most of the tasks were done manually, with tight coordination necessary among the members on the deployment call and an MC running the show. Mistakes and lags invariably happened with every release.
Solutions step by step
There were many downstream processes that contributed to the deployment challenges. However, it was important for us to begin addressing a few problems immediately and to show marked improvements release over release. We started by looking at the low-hanging fruit:
- What dependencies can be reduced or eliminated?
- What tasks can be automated?
- What tasks can be completed as pre-cutover activities before the deployment window?
- Who owns each task?
- Who needs to be on the call to perform a task?
With heavy reliance on automation, we adopted Jenkins to create automated jobs for deploying release components. We resolved and simplified dependencies. We identified and automated the pre-cutover tasks such as getting the production environment ready with proper permissions (to avoid deployment-time failures); validating that correct builds for deployment are automatically picked up by Jenkins from Artifactory; and completing one-time activities like creating mount points, new directories, and so on that don’t necessarily belong in the deployment window. This automation saved us considerable time during the deployment window and avoided errors.
In addition to using Jenkins for automated deployments, creating and documenting checklists to be completed before and after deployment were key to achieving success.
It’s easy to see from Figures 1) and Figure 2) how the progression from manually intensive processes to completely automated processes contributed to solving issues and creating and increasing efficiency.
We simply converted complicated and tightly synchronized manual tasks into workflow-based automatically executed Jenkins jobs. This conversion led to what we started calling “one-click deployments,” meaning that the click of a single button in Jenkins triggers the deployment and validation workflow. This reduces the resources needed for deployment, avoids human error, achieves 80% gains in efficiency–and makes the release truly drama free.
The future—more frequent releases
Our senior director is now asking us, “When can I get daily releases?” The question in and itself shows our progress from where we started.
The organization today is moving toward the next steps: achieving true agility and addressing the need for frequent releases. Being agile means being able to ship an entity as soon as it’s ready. This simple definition hides the complexity of transitioning into that state. It’s now time to look under the hood, to plan for and address each task that will take us closer to achieving this agility.
Automation and adoption of the DevOps process are at the core of being agile. Here are some key points to consider:
- Automation should be part of the product delivery commitment.
- Automation collateral is software, and it should be treated on par with product code
It’s important to avoid scenarios in which, to meet the product delivery timelines, we short-change automation and begin to create a technical debt that becomes unsurmountable over time. We are looking closely at all our downstream challenges that we need to address for dynamic release readiness.
More frequent releases also require:
- Uptime and reliability Our organization today is also focused on defining and strengthening the architectural constructs that can make our applications always reliable and available through autodetection and auto-recovery mechanisms. For example, moving from traditional batch processing architecture (based on Map Reduce framework) to more fault-tolerant microservices (such as Kafka and Spark) makes the architecture more easily auto-monitored and auto-recovered.
- Infrastructure readiness and pipeline progression are also key to achieving our goal of being agile. Development, QA, and STAGE infrastructures must be always “on” and “ready” so that the dev code can be built, deployed, and tested continuously. The dev code needs to move seamlessly from the dev infrastructure through a constant ratchet up of automated validations and decisions based on pass percentages (from developer-centric smokes to continuous integration tests, to system integration tests, to user acceptance tests and load/performance), to finally being deployed on production. We are focused on creating frameworks that make it easy to automate, creating dashboards for visibility, and creating services that can generally be used across applications and purposes. For example, a data service framework would enable the required “data” to be present on an infrastructure of interest dynamically for the tests to access. One of our big challenges today is the availability of an on-demand big data infrastructure. Being able to spin-up dynamic infrastructures in AWS with specified services for developers will facilitate R&D activities thus meeting the challenge.
- Tooling and process Our organization is adopting the Atlassian toolset to strengthen and streamline the continuous process of software delivery. For example, we are currently focusing on adoption of BitBucket, JIRA, and BISECT (through the Atlassian Stack)
- Continuous improvement by creating feedback loops to discuss issues detected during the dev process leading up to and after deployment. Conducting thorough Root Cause Analysis and incorporating the feedback into the downstream processes is important in closing the gaps. Our release management and IT teams conduct a touchpoint meeting to discuss post-production issues. We are strengthening the feedback mechanisms so that any gaps identified that are related to dev, QA, automation, infrastructure, or any other area can be addressed and avoided in the future.
Ultimately, we want to move away from one-click deployments to no-click deployment. In other words, the concept of a “release” will wither away and entities will be deployed as and when they are ready. Are we READY for this? Our exec management has made agility a priority, the teams are highly motivated, and yes absolutely we are READY, now it’s just a matter of completing the journey successfully!