Companies all across the world are adjusting to new working from home policies and are taking precaution to limit the impact diseases are having on the lives of employees and customers. The virus has created a ripple effect impacting everything from a visit to the local grocery store to countless conference cancellations. And the world became aware of this crisis only a little over a month ago. Tech companies have responded by asking, and even requiring, employees to work remotely. The CEO of Zoom Communications stated publicly that usage is at an all-time high, most likely due to restrictions on travel.
For companies that deliver applications that enable remote work and collaboration, this has obvious implications. To enable a sudden and potentially sustained burst of utilization, there needs to be a business continuity plan. Executives at these companies must be asking:
- How do we keep our employees safe and productive?
- How do we continue to meet SLAs as usage increases?
- What is our capacity planning strategy?
- For incidents that do occur, are we adequately prepared to address them?
- As the utilization of services increases, what is the impact on margins?
Indeed, these questions should be top of mind for those companies in the remote workspace, but even companies who now may have larger employee counts working remotely on in-house applications face similar challenges.
These are questions we’re thinking about here at Splunk, where we treat data as the fuel that helps us make better decisions.
From a technical operations perspective, we’ve identified 4 areas where companies can find these answers:
- Measure what matters
- Drive standardization of tools
- Employ an effective escalation policy
- Make learning a part of the process
Measure What Matters
Access to accurate, discoverable, and timely data is what drives collaborative planning and response. Even in the era of the cloud, resources are not limitless. It is critical to develop a deep understanding of infrastructure utilization and how application changes over time have affected performance and reliability, particularly when capacity planning. However, baseline analysis doesn’t adequately safeguard against future incidents. An effective metrics system will be capable of firing an alert within seconds, ensuring fast mean-time-to-acknowledge (MTTA) and detection (MTTD). Distributed tracing has become the go-to debugging approach for more complex application architectures, where multiple services are called to fulfill individual requests. Its effectiveness in identifying causality during incidents can also help technical teams better understand the overall impact on application performance by aggregating metadata contained within the traces to produce tag-specific SLIs.
Drive Standardization of Tools
Unfamiliarity with tools and data sets used across teams creates a huge obstacle in driving responsiveness and cross-team collaboration. It is not uncommon for two teams to produce different metrics from the same datasets. The more tools, the more likely one will encounter data that is or may appear inconsistent. Time will be spent debating dashboard and data validity, rather than focusing on capacity planning and updating runbooks. When something does go wrong, the last thing the incident manager wants to run into are conflicting tools and dashboards. As open-source data collection grows in popularity, and IT Operations companies grow the breadth of offerings, there are more options than ever to collapse the observability stack. – Read more
Learn More About Splunk