Incident Management

Effective incident management is crucial for maintaining user trust during service disruptions. StatusPageOne provides comprehensive tools for both manual incident management and automated incident detection.

Understanding Incidents vs Maintenance

Incidents

Unexpected service disruptions that impact users:

Service outages or failures
Performance degradations
Security issues affecting availability
Database connectivity problems
Third-party integration failures

Maintenance

Planned activities that may affect service availability:

Scheduled system updates
Database maintenance windows
Infrastructure migrations
Feature deployments
Security patches

Incident Lifecycle

1. Detection and Creation

Manual creation: Admin creates incident manually
Automated detection: System automatically creates incidents based on monitor failures
User reports: Information gathered from support tickets or user feedback

2. Investigation and Updates

Initial response: Acknowledge the incident and begin investigation
Regular updates: Provide status updates at regular intervals
Progress communication: Share findings and resolution steps

3. Resolution and Follow-up

Service restoration: Fix the underlying issue
Verification: Confirm services are fully operational
Post-incident review: Analyze cause and prevention measures

Manual Incident Creation

Creating an Incident

Navigate to Incidents
- Go to Status Pages > Incidents
- Click Create Incident button

Incidents Dashboard

Incident Details
- Title: Clear, descriptive title (e.g., "API Response Delays")
- Description: Detailed explanation of the issue and impact
- Severity: Choose from Minor, Major, or Critical

Create Incident Form

Affected Services
- Select which monitors/services are affected
- Multiple services can be associated with one incident
Initial Status
- Investigating: Currently diagnosing the issue
- Identified: Root cause has been found
- Monitoring: Fix applied, monitoring for stability
- Resolved: Issue completely resolved

Incident Severity Levels

Minor Incidents

Limited impact on users
Non-critical features affected
Workarounds available
Example: Secondary API endpoint slow response

Major Incidents

Significant impact on core functionality
Many users affected
Limited workarounds
Example: Database performance issues affecting all users

Critical Incidents

Complete service unavailability
All users affected
No workarounds available
Example: Complete system outage

Incident Severity Levels

Best Practices for Manual Incidents

Title Guidelines

Be specific: "Payment Processing Delays" vs "System Issues"
Avoid technical jargon: Use user-friendly language
Keep it concise: 50 characters or less when possible
Include impact: What users are experiencing

Description Writing

Start with impact: What users are experiencing
Provide context: When the issue started
Avoid blame: Focus on resolution, not fault
Set expectations: Estimated resolution timeframe if known

Automated Incident Generation

Configuration

Set up automatic incident creation for each monitor:

Navigate to Monitor Settings
- Go to Monitors > Select monitor > Settings
- Find Incident Configuration section
Enable Auto-Generation
- Toggle Auto-generate incidents to ON
- Configure failure thresholds and parameters

Auto-Generation Parameters

Failure Threshold

Default: 3 consecutive failures
Range: 1-10 consecutive failures
Recommendation: 3-5 failures to avoid false positives
Consideration: Balance between quick detection and false alarms

Failure Window

Default: 5 minutes
Range: 1-30 minutes
Purpose: Time window to evaluate consecutive failures
Example: 3 failures within 5 minutes triggers incident

Default Severity

Minor: Non-critical services or internal tools
Major: Important user-facing services (default)
Critical: Essential services that block user workflows

Incident Grouping

Related monitor failures are automatically grouped:

Grouping Logic

Time-based: Failures occurring within 10 minutes
Status page: Only monitors from the same status page
Related services: Services likely to fail together

Benefits

Reduced noise: One incident instead of multiple
Better communication: Consolidated updates
Easier management: Single resolution workflow

Incident Updates and Communication

Status Update Types

Investigating

When to use: Initial incident detection
Message focus: Acknowledging the issue and beginning investigation
Example: "We're investigating reports of slow API responses affecting user logins."

Identified

When to use: Root cause has been determined
Message focus: Explaining what caused the issue
Example: "We've identified a database performance issue causing login delays. Our team is implementing a fix."

Monitoring

When to use: Fix has been applied, monitoring for stability
Message focus: Solution deployed, watching for full recovery
Example: "The database issue has been resolved. We're monitoring the system to ensure full stability."

Resolved

When to use: Issue is completely fixed and stable
Message focus: Confirmation of resolution and any follow-up actions
Example: "All systems are now fully operational. Login performance has returned to normal."

Update Best Practices

Frequency Guidelines

Every 30 minutes for critical incidents
Every hour for major incidents
Every 2-4 hours for minor incidents
More frequently during active resolution work

Communication Principles

Be honest: Don't hide information or downplay issues
Be clear: Use simple language everyone can understand
Be timely: Don't wait for perfect information
Be consistent: Match your tone across all updates

Update Template

**Status**: [Current status]
**Update**: [What's happening now]
**Impact**: [How users are affected]
**Next Steps**: [What we're doing next]
**ETA**: [Expected resolution time, if known]

Maintenance Management

Scheduling Maintenance

Create Maintenance Event
- Go to Status Pages > Maintenance
- Click Schedule Maintenance

Maintenance Dashboard

Maintenance Details
- Title: Clear description of the work
- Description: What's being updated and expected impact
- Affected Services: Which services may be impacted
Schedule Settings
- Start Time: When maintenance begins
- End Time: Expected completion time
- Time Zone: Displayed in user's local time zone

Maintenance Communication

Advance Notice

1 week minimum for major maintenance
72 hours minimum for minor maintenance
24 hours minimum for emergency maintenance
Multiple reminders leading up to the event

During Maintenance

Start notification: When maintenance begins
Progress updates: If maintenance runs longer than expected
Completion notification: When maintenance is finished

Maintenance Best Practices

Timing Considerations

Off-peak hours: Schedule during lowest usage periods
Time zones: Consider your global user base
Avoid holidays: Don't schedule during major holidays
Business impact: Consider business-critical times

Communication Strategy

Multi-channel: Status page, email, in-app notifications
Clear impact: Explain exactly what users will experience
Backup plans: Have rollback procedures ready
Contact information: Provide escalation contacts if needed

Subscriber Notifications

Automatic Notifications

When incidents or maintenance are created/updated:

Email subscribers receive immediate notifications
Slack integrations post to configured channels
Notifications respect user subscription preferences

Notification Content

Email Notifications

Subject line: Includes status page name and incident title
Body content: Status update with timestamp and next steps
Unsubscribe link: Easy opt-out for users
Branding: Includes your logo and styling

Slack Notifications

Rich formatting: Uses Slack's message formatting
Status colors: Color-coded based on severity
Action buttons: Links to full status page
Thread organization: Updates post in threads for context

Post-Incident Procedures

Incident Resolution

Resolution Checklist

Verify fix: Confirm the underlying issue is resolved
Monitor stability: Watch for recurring problems
Update status: Change incident status to "Resolved"
Final communication: Notify users of complete resolution
Document lessons: Record what was learned

Post-Mortem Process

When to Conduct

All major incidents: Complete analysis required
Critical incidents: Detailed post-mortem with timeline
Recurring issues: Pattern analysis and prevention
Customer impact: Significant user-facing problems

Post-Mortem Elements

Timeline: Detailed sequence of events
Root cause: Technical explanation of the failure
Impact assessment: User and business impact
Resolution steps: What was done to fix the issue
Prevention measures: Steps to prevent recurrence

Integration with Monitoring

Monitor-to-Incident Flow

Monitor Failure → Threshold Check → Auto-Incident Creation → Notification Sending → Status Page Update

Configuration Tips

Threshold Tuning

Start conservative: Use higher thresholds initially
Monitor false positives: Adjust based on experience
Service criticality: More sensitive for critical services
Historical data: Use past incidents to calibrate

Incident Routing

Service grouping: Related services share incident configurations
Escalation paths: Define who gets notified for different severities
Time-based rules: Different thresholds for business vs off hours

Metrics and Analysis

Incident Tracking

StatusPageOne automatically tracks:

Incident frequency: How often incidents occur
Resolution times: MTTR (Mean Time To Resolution)
Severity distribution: Breakdown of incident types
User impact: Affected user estimates

Performance Metrics

Key Indicators

Uptime percentage: Overall service availability
Incident count: Total incidents per time period
Average resolution time: Speed of incident resolution
User satisfaction: Subscriber feedback and engagement

Reporting

Weekly summaries: Incident and maintenance reports
Monthly trends: Long-term availability analysis
Annual reviews: Year-over-year reliability improvements
Custom exports: Data for internal analysis

Effective incident management builds user trust through transparent, timely communication during service disruptions while providing the tools needed for quick resolution and continuous improvement.

Incident Management

Understanding Incidents vs Maintenance

Incidents

Unexpected service disruptions that impact users:

Service outages or failures
Performance degradations
Security issues affecting availability
Database connectivity problems
Third-party integration failures

Maintenance

Planned activities that may affect service availability:

Scheduled system updates
Database maintenance windows
Infrastructure migrations
Feature deployments
Security patches

Incident Lifecycle

1. Detection and Creation

Manual creation: Admin creates incident manually
Automated detection: System automatically creates incidents based on monitor failures
User reports: Information gathered from support tickets or user feedback

2. Investigation and Updates

Initial response: Acknowledge the incident and begin investigation
Regular updates: Provide status updates at regular intervals
Progress communication: Share findings and resolution steps

3. Resolution and Follow-up

Service restoration: Fix the underlying issue
Verification: Confirm services are fully operational
Post-incident review: Analyze cause and prevention measures

Manual Incident Creation

Creating an Incident

Navigate to Incidents
- Go to Status Pages > Incidents
- Click Create Incident button

Incidents Dashboard

Incident Details
- Title: Clear, descriptive title (e.g., "API Response Delays")
- Description: Detailed explanation of the issue and impact
- Severity: Choose from Minor, Major, or Critical

Create Incident Form

Affected Services
- Select which monitors/services are affected
- Multiple services can be associated with one incident
Initial Status
- Investigating: Currently diagnosing the issue
- Identified: Root cause has been found
- Monitoring: Fix applied, monitoring for stability
- Resolved: Issue completely resolved

Incident Severity Levels

Minor Incidents

Limited impact on users
Non-critical features affected
Workarounds available
Example: Secondary API endpoint slow response

Major Incidents

Significant impact on core functionality
Many users affected
Limited workarounds
Example: Database performance issues affecting all users

Critical Incidents

Complete service unavailability
All users affected
No workarounds available
Example: Complete system outage

Incident Severity Levels

Best Practices for Manual Incidents

Title Guidelines

Be specific: "Payment Processing Delays" vs "System Issues"
Avoid technical jargon: Use user-friendly language
Keep it concise: 50 characters or less when possible
Include impact: What users are experiencing

Description Writing

Start with impact: What users are experiencing
Provide context: When the issue started
Avoid blame: Focus on resolution, not fault
Set expectations: Estimated resolution timeframe if known

Automated Incident Generation

Configuration

Set up automatic incident creation for each monitor:

Navigate to Monitor Settings
- Go to Monitors > Select monitor > Settings
- Find Incident Configuration section
Enable Auto-Generation
- Toggle Auto-generate incidents to ON
- Configure failure thresholds and parameters

Auto-Generation Parameters

Failure Threshold

Default: 3 consecutive failures
Range: 1-10 consecutive failures
Recommendation: 3-5 failures to avoid false positives
Consideration: Balance between quick detection and false alarms

Failure Window

Default: 5 minutes
Range: 1-30 minutes
Purpose: Time window to evaluate consecutive failures
Example: 3 failures within 5 minutes triggers incident

Default Severity

Minor: Non-critical services or internal tools
Major: Important user-facing services (default)
Critical: Essential services that block user workflows

Incident Grouping

Related monitor failures are automatically grouped:

Grouping Logic

Time-based: Failures occurring within 10 minutes
Status page: Only monitors from the same status page
Related services: Services likely to fail together

Benefits

Reduced noise: One incident instead of multiple
Better communication: Consolidated updates
Easier management: Single resolution workflow

Incident Updates and Communication

Status Update Types

Investigating

When to use: Initial incident detection
Message focus: Acknowledging the issue and beginning investigation
Example: "We're investigating reports of slow API responses affecting user logins."

Identified

When to use: Root cause has been determined
Message focus: Explaining what caused the issue
Example: "We've identified a database performance issue causing login delays. Our team is implementing a fix."

Monitoring

When to use: Fix has been applied, monitoring for stability
Message focus: Solution deployed, watching for full recovery
Example: "The database issue has been resolved. We're monitoring the system to ensure full stability."

Resolved

When to use: Issue is completely fixed and stable
Message focus: Confirmation of resolution and any follow-up actions
Example: "All systems are now fully operational. Login performance has returned to normal."

Update Best Practices

Frequency Guidelines

Every 30 minutes for critical incidents
Every hour for major incidents
Every 2-4 hours for minor incidents
More frequently during active resolution work

Communication Principles

Be honest: Don't hide information or downplay issues
Be clear: Use simple language everyone can understand
Be timely: Don't wait for perfect information
Be consistent: Match your tone across all updates

Update Template

**Status**: [Current status]
**Update**: [What's happening now]
**Impact**: [How users are affected]
**Next Steps**: [What we're doing next]
**ETA**: [Expected resolution time, if known]

Maintenance Management

Scheduling Maintenance

Create Maintenance Event
- Go to Status Pages > Maintenance
- Click Schedule Maintenance

Maintenance Dashboard

Maintenance Details
- Title: Clear description of the work
- Description: What's being updated and expected impact
- Affected Services: Which services may be impacted
Schedule Settings
- Start Time: When maintenance begins
- End Time: Expected completion time
- Time Zone: Displayed in user's local time zone

Maintenance Communication

Advance Notice

1 week minimum for major maintenance
72 hours minimum for minor maintenance
24 hours minimum for emergency maintenance
Multiple reminders leading up to the event

During Maintenance

Start notification: When maintenance begins
Progress updates: If maintenance runs longer than expected
Completion notification: When maintenance is finished

Maintenance Best Practices

Timing Considerations

Off-peak hours: Schedule during lowest usage periods
Time zones: Consider your global user base
Avoid holidays: Don't schedule during major holidays
Business impact: Consider business-critical times

Communication Strategy

Multi-channel: Status page, email, in-app notifications
Clear impact: Explain exactly what users will experience
Backup plans: Have rollback procedures ready
Contact information: Provide escalation contacts if needed

Subscriber Notifications

Automatic Notifications

When incidents or maintenance are created/updated:

Email subscribers receive immediate notifications
Slack integrations post to configured channels
Notifications respect user subscription preferences

Notification Content

Email Notifications

Subject line: Includes status page name and incident title
Body content: Status update with timestamp and next steps
Unsubscribe link: Easy opt-out for users
Branding: Includes your logo and styling

Slack Notifications

Rich formatting: Uses Slack's message formatting
Status colors: Color-coded based on severity
Action buttons: Links to full status page
Thread organization: Updates post in threads for context

Post-Incident Procedures

Incident Resolution

Resolution Checklist

Verify fix: Confirm the underlying issue is resolved
Monitor stability: Watch for recurring problems
Update status: Change incident status to "Resolved"
Final communication: Notify users of complete resolution
Document lessons: Record what was learned

Post-Mortem Process

When to Conduct

All major incidents: Complete analysis required
Critical incidents: Detailed post-mortem with timeline
Recurring issues: Pattern analysis and prevention
Customer impact: Significant user-facing problems

Post-Mortem Elements

Timeline: Detailed sequence of events
Root cause: Technical explanation of the failure
Impact assessment: User and business impact
Resolution steps: What was done to fix the issue
Prevention measures: Steps to prevent recurrence

Integration with Monitoring

Monitor-to-Incident Flow

Monitor Failure → Threshold Check → Auto-Incident Creation → Notification Sending → Status Page Update

Configuration Tips

Threshold Tuning

Start conservative: Use higher thresholds initially
Monitor false positives: Adjust based on experience
Service criticality: More sensitive for critical services
Historical data: Use past incidents to calibrate

Incident Routing

Service grouping: Related services share incident configurations
Escalation paths: Define who gets notified for different severities
Time-based rules: Different thresholds for business vs off hours

Metrics and Analysis

Incident Tracking

StatusPageOne automatically tracks:

Incident frequency: How often incidents occur
Resolution times: MTTR (Mean Time To Resolution)
Severity distribution: Breakdown of incident types
User impact: Affected user estimates

Performance Metrics

Key Indicators

Uptime percentage: Overall service availability
Incident count: Total incidents per time period
Average resolution time: Speed of incident resolution
User satisfaction: Subscriber feedback and engagement

Reporting

Weekly summaries: Incident and maintenance reports
Monthly trends: Long-term availability analysis
Annual reviews: Year-over-year reliability improvements
Custom exports: Data for internal analysis

Effective incident management builds user trust through transparent, timely communication during service disruptions while providing the tools needed for quick resolution and continuous improvement.

Incidents

Incident Management

Understanding Incidents vs Maintenance

Incidents

Maintenance

Incident Lifecycle

1. Detection and Creation

2. Investigation and Updates

3. Resolution and Follow-up

Manual Incident Creation

Creating an Incident

Incident Severity Levels

Minor Incidents

Major Incidents

Critical Incidents

Best Practices for Manual Incidents

Title Guidelines

Description Writing

Automated Incident Generation

Configuration

Auto-Generation Parameters

Failure Threshold

Failure Window

Default Severity

Incident Grouping

Grouping Logic

Benefits

Incident Updates and Communication

Status Update Types

Investigating

Identified

Monitoring

Resolved

Update Best Practices

Frequency Guidelines

Communication Principles

Update Template

Maintenance Management

Scheduling Maintenance

Maintenance Communication

Advance Notice

During Maintenance

Maintenance Best Practices

Timing Considerations

Communication Strategy

Subscriber Notifications

Automatic Notifications

Notification Content

Email Notifications

Slack Notifications

Post-Incident Procedures

Incident Resolution

Resolution Checklist

Post-Mortem Process

When to Conduct

Post-Mortem Elements

Integration with Monitoring

Monitor-to-Incident Flow

Configuration Tips

Threshold Tuning

Incident Routing

Metrics and Analysis

Incident Tracking

Performance Metrics

Key Indicators

Reporting

Improve this page

Incidents

Incident Management

Understanding Incidents vs Maintenance

Incidents

Maintenance

Incident Lifecycle

1. Detection and Creation

2. Investigation and Updates

3. Resolution and Follow-up

Manual Incident Creation

Creating an Incident

Incident Severity Levels

Minor Incidents