Objective

Ensure the availability and performance of the Acme Web Application, a critical service for business operations, by identifying dependencies, pinpointing root causes of downtime, and using OpsRamp Service Maps to streamline resolution.

Scenario

During a high-traffic period, the Acme Web Application goes down unexpectedly. Users report connectivity issues, and monitoring tools generate multiple alerts. The application comprises:

  • A Load Balancer to manage incoming traffic.
  • Multiple Application Servers to process requests.
  • Database Servers for data storage and management. The downtime causes significant business disruption, and quick resolution is paramount.

Challenges

  • Is the web application available to users?
  • Which component is causing the issue if the service is unavailable?
  • What percentage of time was the application available?
  • How much downtime is attributable to each component?

Solution

OpsRamp Service Maps provide an intuitive way to address these challenges. Here’s how you can build and use a Service Map to monitor and resolve the issue effectively:

  1. Create the Service Map
    1. Configure your main Service node - Acme Web Application to represent the entire service.
    2. Configure its availability thresholds and enable alerts to track downtime.
    3. Configure Resource nodes. In this use case:
      • Apache Web Server is a Service node to manage the multiple resources. So, add this as a service node.
      • Cisco NXOS Switch is a networking component. Add as resource node.
      • Storage Service Map is an existing map linked to the current service map to monitor dynamic updates.
      • Database Service Map is also created separately. However, it is imported to the current Service Map as a static copy.
      You can view all the existing Service Maps on the Service Maps listing page and filter them based on the type and availability.
  2. Monitor in Real-Time
    • Green Nodes: Indicates components are operational.
    • Red Nodes: Highlights failures, such as the Cisco NXOS Switch, causing the service to be down.
    • Orange Nodes: Indicates degraded performance, and this is caused by underlying nodes.
  3. Drill Down for Root Cause
    1. Open the Resource Node Details panel for the Cisco NXOS Switch:
      • Review Alerts to confirm it as the root cause.
      • Assess downstream impact on connected nodes.
      If you want to see the details of the alerts or incidents for the entire Service Map, select your map from the Service Maps listing page and view these details in the slide-out.
  4. Take Corrective Actions
    • Redistribute traffic to operational components.
    • Edit threshold levels, if required.
    • Allocate additional resources to handle the load.
    • Resolve the failing switch issue and validate recovery.
    The Service Map dynamically updates, showing all nodes back to green status.

Outcome

Using Service Maps, your team:

  • Identifies the failing Cisco NXOS Switch as the root cause within minutes.
  • Implements corrective actions, restoring service quickly.
  • Gains actionable insights for future optimization and scalability.