Files
gaussian-wellworks/Runbooks/System Admin Runbook.md
2026-05-09 09:36:21 -05:00

9.5 KiB
Raw Blame History

Fractional Insight CIO

Operational Runbook — Daily & Weekly Administration

DigitalOcean / Docker / Nextcloud / Fastmail Pilot Environment


1. Purpose

This operational runbook defines the recurring administrative tasks required to safely operate and maintain the client pilot environment hosted on DigitalOcean infrastructure using:

  • Nextcloud
  • Fastmail
  • Docker containers
  • Linux server administration
  • Reverse proxy / SSL management
  • Backup and recovery validation
  • Security and compliance oversight

The goal of this runbook is to:

  • Reduce operational risk
  • Reduce exposure to liability
  • Detect security incidents early
  • Ensure recoverability of client data
  • Maintain stable uptime and user access
  • Establish evidence of reasonable administrative diligence

This document assumes:

  • Remote users
  • No internal IT staff
  • Small pilot deployment
  • Shared responsibility model between consultant and client
  • MFA enforcement in both Fastmail and Nextcloud

2. Operational Philosophy

The environment should be treated as:

  • A business collaboration platform
  • A controlled data environment
  • A security-sensitive system
  • A system requiring documented administrative oversight

Because the platform contains:

  • Client communications
  • Potential confidential documents
  • Shared file repositories
  • User credentials
  • Internet-exposed services

…administration must prioritize:

  1. Security first
  2. Recoverability second
  3. Stability third
  4. Convenience last

3. Daily Operational Tasks


3.1 Morning Health Check

Frequency

Daily (business days)

Estimated Time

1015 minutes

Objective

Confirm that all core systems are operational before users begin work.

Tasks

Infrastructure

  • Verify droplet online in DigitalOcean
  • Verify CPU/RAM/disk usage within normal thresholds
  • Verify disk utilization below 80%
  • Verify Docker daemon operational

Services

  • Verify Nextcloud web login functional
  • Verify Fastmail operational status
  • Verify SSL certificates valid
  • Verify reverse proxy routing functional

Containers

Check:

  • Nextcloud container
  • Database container
  • Redis container
  • Reverse proxy container

Example:

docker ps

External Access Test

Validate:

  • HTTPS access
  • File upload/download
  • Login functionality

Email

Send/receive test email through Fastmail test account.

Deliverable

  • Daily operational log entry

3.2 Security Event Review

Frequency

Daily

Estimated Time

10 minutes

Objective

Identify suspicious activity before escalation.

Tasks

Review:

  • Failed login attempts
  • MFA failures
  • New device logins
  • Suspicious IP addresses
  • Excessive upload activity
  • Unexpected admin actions

Check:

  • Nextcloud security warnings
  • Linux auth logs
  • Docker errors
  • Reverse proxy logs

Example:

sudo journalctl -p 3 -xb

Escalation Triggers

Immediate escalation if:

  • Multiple failed admin logins
  • MFA bypass suspicion
  • Unknown admin account
  • Malware/ransomware indicators
  • Unexpected outbound traffic

Deliverable

  • Security review noted in operational log

3.3 Backup Verification

Frequency

Daily

Estimated Time

510 minutes

Objective

Verify backups completed successfully.

Tasks

Verify:

  • Scheduled backup job completed
  • Backup storage reachable
  • Backup size reasonable
  • No corruption warnings
  • Snapshot success in DigitalOcean

Validate:

  • Latest backup timestamp
  • Database dump presence
  • File archive generation

Important

A backup that has not been validated should be treated as nonexistent.

Deliverable

  • Backup verification entry in operational log

3.4 User Administration Review

Frequency

Daily

Estimated Time

510 minutes

Objective

Ensure user/account integrity.

Tasks

Review:

  • New user requests
  • Disabled users
  • Terminated personnel
  • Permission changes
  • Shared folder permissions
  • Public links

Verify:

  • No orphaned admin accounts
  • MFA enabled for all admins
  • Least-privilege principles maintained

High-Risk Areas

  • Shared folders with external access
  • Public upload links
  • Administrative delegation

Deliverable

  • Access review note

3.5 Incident Queue Review

Frequency

Daily

Estimated Time

515 minutes

Objective

Identify unresolved operational or security issues.

Tasks

Review:

  • User tickets
  • Error reports
  • Sync failures
  • Email delivery issues
  • Storage complaints
  • Permission problems

Deliverable

  • Updated incident tracking

4. Weekly Operational Tasks


4.1 Operating System Updates

Frequency

Weekly

Estimated Time

3060 minutes

Objective

Maintain security posture and system stability.

Tasks

Linux Updates

sudo apt update
sudo apt upgrade

Docker

  • Update container images
  • Rebuild containers if necessary
  • Remove unused images

Validate:

  • Nextcloud functionality after updates
  • Database connectivity
  • Reverse proxy operation

Important

Do NOT apply major-version upgrades during business hours.

Deliverable

  • Patch log
  • Change log entry

4.2 Nextcloud Maintenance Review

Frequency

Weekly

Estimated Time

2030 minutes

Tasks

Review:

  • Security warnings
  • Integrity check results
  • App updates
  • Background jobs
  • Storage consumption

Validate:

  • Cron jobs functioning
  • File scanning healthy
  • No database corruption warnings

Execute

docker exec -it nextcloud-app php occ status

Deliverable

  • Weekly maintenance report

4.3 Backup Restore Test

Frequency

Weekly

Estimated Time

3060 minutes

Objective

Prove recoverability.

Tasks

Restore:

  • Single file
  • Database dump
  • User folder sample

Verify:

  • File integrity
  • Permissions
  • Recovery speed

Critical Principle

If restore testing is not performed, liability exposure increases substantially.

Deliverable

  • Restore validation report

4.4 Security Audit Review

Frequency

Weekly

Estimated Time

30 minutes

Tasks

Review:

  • Admin accounts
  • Group memberships
  • External shares
  • Public links
  • Expired accounts
  • MFA compliance

Validate:

  • SSL certificate expiration dates
  • Firewall rules
  • SSH access
  • Root login disabled
  • Fail2Ban status (if implemented)

Deliverable

  • Weekly security audit checklist

4.5 Capacity and Performance Review

Frequency

Weekly

Estimated Time

2030 minutes

Tasks

Analyze:

  • Storage growth
  • User growth
  • Bandwidth usage
  • CPU/RAM trends
  • Database size growth

Evaluate:

  • Need for droplet resize
  • Need for archive policies
  • Need for retention changes

Deliverable

  • Capacity trend notes

4.6 Documentation and Change Log

Frequency

Weekly

Estimated Time

1520 minutes

Objective

Maintain defensible operational records.

Tasks

Document:

  • Changes made
  • Accounts added/removed
  • Incidents
  • Security events
  • Backup issues
  • Maintenance performed

Important

Operational documentation is part of liability protection.

If a breach occurs, documented operational diligence matters significantly.

Deliverable

  • Weekly operational summary

5. Monthly Administrative Tasks


5.1 Full Disaster Recovery Exercise

Estimated Time

24 hours

Tasks

Simulate:

  • Server loss
  • Container rebuild
  • Restore from backup
  • DNS validation
  • SSL restoration

5.2 User Access Certification

Estimated Time

3060 minutes

Tasks

Review with client:

  • Active users
  • Admin privileges
  • External sharing
  • Terminated employees

5.3 Security Policy Review

Estimated Time

30 minutes

Tasks

Review:

  • MFA compliance
  • Password standards
  • Administrative access
  • Training completion

6. Estimated Operational Effort

Activity Estimated Time
Daily Operations 3560 min/day
Weekly Maintenance 24 hrs/week
Monthly DR/Security 36 hrs/month

7. Recommended Retainer Guidance

For a pilot of this size:

Service Level Estimated Monthly Hours
Minimal Reactive Support 810 hrs
Recommended Operational Support 1520 hrs
Security-Conscious Managed Support 2535 hrs

Given the recent discussions around:

  • liability
  • data protection
  • backup validation
  • MFA enforcement
  • user training
  • documented diligence

…the “Recommended Operational Support” tier is likely the minimum responsible posture.


8. Key Risk Areas to Monitor

The largest liability exposure areas are:

Administrative Misconfiguration

  • Incorrect sharing permissions
  • Public links
  • Excessive admin rights

Backup Failure

  • Silent backup corruption
  • Unverified restores

Credential Compromise

  • Weak passwords
  • MFA disabled
  • Phishing

Delayed Patching

  • Unpatched Nextcloud vulnerabilities
  • Docker/container CVEs
  • Linux exploits

User Behavior

  • Unsafe uploads
  • Credential reuse
  • Local machine compromise

Lack of Documentation

  • No operational evidence
  • No audit trail
  • Undefined responsibilities

9. Strong Recommendations

Require:

  • MFA for all users
  • Mandatory admin training
  • Signed acceptable use/security acknowledgment
  • Principle of least privilege
  • Centralized logging
  • Automated monitoring alerts
  • Offsite backups
  • Written incident response plan
  • Cyber liability / E&O insurance

Avoid:

  • Shared admin accounts
  • Permanent public links
  • Unrestricted upload folders
  • Direct root SSH access
  • Unmanaged personal devices for administrators