O11y - Data is the Key to Understanding Application Behavior
Monitoring solutions typically involve the collection of logs, traces, and metrics that assist organizations in comprehending the behavior of their applications. These tools process vast quantities of data to provide valuable insights. Traditionally, monitoring was conducted in silos, necessitating separate tools for monitoring various aspects, such as applications, networks, security, and more.
Observability (O11y) platforms were introduced to dismantle these silos. They offer a unified and integrated solution that harnesses data from multiple domains, providing comprehensive insights into customer and employee experiences, as well as application and network performance.
What Constitutes a Poor Application or Experience?
We value the convenience of obtaining what we require within a few clicks. For vendors to deliver such an experience entails significant risks. Today’s applications have undergone substantial transformations over the years, resulting in increased complexity. Customers exhibit minimal patience for slow applications, errors, or crashes. As swiftly as an application or a SaaS service can be downloaded or acquired, it can be replaced.
Organizations employ monitoring tools to ensure that their customers have a satisfactory experience. During the pandemic, organizations recognized the paramount importance of providing a positive experience for their employees, regardless of whether they were working from home or in an office. With the extensive range of tools available today, organizations can monitor at the server, client, in real-time, or synthetically. Businesses and technology professionals have leveraged these tools to identify issues that impact experience, profitability, before end-users become aware of them.
Data: The Driving Force Behind Cloud Computing
The advent of cloud computing has revolutionized the landscape of monolithic systems. Complexity has surged exponentially. What was once a modest amount of telemetry has transformed into vast infrastructure spread across multiple cloud providers and geographies to meet the ever-expanding demand, while generating millions of metrics every minute.
In the early days of monitoring, we prioritized collecting as much telemetry data as possible. However, today, I encounter a new challenge in selecting the appropriate metrics, determining the correct cardinality, and setting the optimal retention period. While insufficient data collection hinders trend identification and root cause analysis, it also incurs significant costs due to the potential for high costs, scalability issues, and poor insights associated with excessive telemetry.
Determining the appropriate data collection strategy is crucial. A common rule of thumb in observability (o11y) is to collect everything, as it aligns with the insights companies require. However, data quality is often compromised, and despite our confidence in AI, data without enrichment lacks value, regardless of the volume collected. Platforms that prioritize data collection may inadvertently engage in data hoarding, passing on the associated costs to the end-user.
When selecting metrics and determining their appropriate measurement, it is essential to consider the specific requirements of the application. The primary objective should be to observe and measure metrics that directly impact revenue generation. Therefore, it is crucial to identify the fundamental purpose of an application and establish its importance. Once the application’s significance is determined, it is necessary to define the acceptable tolerance level for its performance, availability, and reliability (e.g., SLO, SLA). Armed with these parameters, organizations can select the most relevant metrics that align with their use case and enhance the value of their applications for their customers. This approach avoids the pitfalls of excessive data collection and relies on AI to interpret the collected data effectively.
Even as Product Manager, it’s important to build this flexibility into your product that allows users to choose what telemetry (or group of telemetry) data should be collected. This allows users to choose what it’s important to them and pay for the data that is being collected.
Which Metrics Should I Prioritize?
When developing a comprehensive scope, it is generally advisable to adopt a top-down approach. This strategy ensures that metrics are selected based on established best practices rather than an ad hoc approach. It is crucial to consider the context in which metrics will be used, as committing them to an observability pipeline without a thorough understanding of their requirements can lead to an overwhelming influx of data and potentially inaccurate alerts.
False alerts have almost always lead to loosing credibility for the platform or the introduction of yet another tool (like Moogsoft) to validate if the alert is false positive.
A Comprehensive O11Y Strategy for Mitigating Risks and Enhancing Data Management
A mature O11Y strategy encompasses the development of a standardized metrics package tailored to the most commonly used products and services. This approach serves as a proactive measure to reduce the likelihood of metrics FOMO (Fear Of Missing Out) and false positives.
Future-Proofing O11Y: Navigating the Data Deluge
The exponential growth of data presents an ongoing challenge for organizations seeking to maintain effective O11Y and vendors providing O11Y platforms. However, this challenge presents an opportunity for strategic data collection, prioritization of meaningful metrics over exhaustive data collection, and the establishment of robust logging standards.
Additionally, O11Y vendors must prioritize advanced capabilities for managing high-cardinality data and controlling costs. This enables customers to optimize their O11Y strategy and extract valuable insights without being overwhelmed by data. In a dynamic growth environment, effectively managing data volume is crucial for ensuring exceptional customer and user experiences, driving innovation, and achieving sustainable success.