Files
gaussian-wellworks/Runbooks/System Admin Runbook.md
2026-05-09 09:36:21 -05:00

584 lines
9.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Fractional Insight CIO
## Operational Runbook — Daily & Weekly Administration
### DigitalOcean / Docker / Nextcloud / Fastmail Pilot Environment
---
# 1. Purpose
This operational runbook defines the recurring administrative tasks required to safely operate and maintain the client pilot environment hosted on DigitalOcean infrastructure using:
- Nextcloud
- Fastmail
- Docker containers
- Linux server administration
- Reverse proxy / SSL management
- Backup and recovery validation
- Security and compliance oversight
The goal of this runbook is to:
- Reduce operational risk
- Reduce exposure to liability
- Detect security incidents early
- Ensure recoverability of client data
- Maintain stable uptime and user access
- Establish evidence of reasonable administrative diligence
This document assumes:
- Remote users
- No internal IT staff
- Small pilot deployment
- Shared responsibility model between consultant and client
- MFA enforcement in both Fastmail and Nextcloud
---
# 2. Operational Philosophy
The environment should be treated as:
- A business collaboration platform
- A controlled data environment
- A security-sensitive system
- A system requiring documented administrative oversight
Because the platform contains:
- Client communications
- Potential confidential documents
- Shared file repositories
- User credentials
- Internet-exposed services
…administration must prioritize:
1. Security first
2. Recoverability second
3. Stability third
4. Convenience last
---
# 3. Daily Operational Tasks
---
## 3.1 Morning Health Check
### Frequency
Daily (business days)
### Estimated Time
1015 minutes
### Objective
Confirm that all core systems are operational before users begin work.
### Tasks
#### Infrastructure
- Verify droplet online in DigitalOcean
- Verify CPU/RAM/disk usage within normal thresholds
- Verify disk utilization below 80%
- Verify Docker daemon operational
#### Services
- Verify Nextcloud web login functional
- Verify Fastmail operational status
- Verify SSL certificates valid
- Verify reverse proxy routing functional
#### Containers
Check:
- Nextcloud container
- Database container
- Redis container
- Reverse proxy container
Example:
```bash
docker ps
```
#### External Access Test
Validate:
- HTTPS access
- File upload/download
- Login functionality
#### Email
Send/receive test email through Fastmail test account.
### Deliverable
- Daily operational log entry
---
## 3.2 Security Event Review
### Frequency
Daily
### Estimated Time
10 minutes
### Objective
Identify suspicious activity before escalation.
### Tasks
#### Review:
- Failed login attempts
- MFA failures
- New device logins
- Suspicious IP addresses
- Excessive upload activity
- Unexpected admin actions
#### Check:
- Nextcloud security warnings
- Linux auth logs
- Docker errors
- Reverse proxy logs
Example:
```bash
sudo journalctl -p 3 -xb
```
### Escalation Triggers
Immediate escalation if:
- Multiple failed admin logins
- MFA bypass suspicion
- Unknown admin account
- Malware/ransomware indicators
- Unexpected outbound traffic
### Deliverable
- Security review noted in operational log
---
## 3.3 Backup Verification
### Frequency
Daily
### Estimated Time
510 minutes
### Objective
Verify backups completed successfully.
### Tasks
#### Verify:
- Scheduled backup job completed
- Backup storage reachable
- Backup size reasonable
- No corruption warnings
- Snapshot success in DigitalOcean
#### Validate:
- Latest backup timestamp
- Database dump presence
- File archive generation
### Important
A backup that has not been validated should be treated as nonexistent.
### Deliverable
- Backup verification entry in operational log
---
## 3.4 User Administration Review
### Frequency
Daily
### Estimated Time
510 minutes
### Objective
Ensure user/account integrity.
### Tasks
#### Review:
- New user requests
- Disabled users
- Terminated personnel
- Permission changes
- Shared folder permissions
- Public links
#### Verify:
- No orphaned admin accounts
- MFA enabled for all admins
- Least-privilege principles maintained
### High-Risk Areas
- Shared folders with external access
- Public upload links
- Administrative delegation
### Deliverable
- Access review note
---
## 3.5 Incident Queue Review
### Frequency
Daily
### Estimated Time
515 minutes
### Objective
Identify unresolved operational or security issues.
### Tasks
Review:
- User tickets
- Error reports
- Sync failures
- Email delivery issues
- Storage complaints
- Permission problems
### Deliverable
- Updated incident tracking
---
# 4. Weekly Operational Tasks
---
## 4.1 Operating System Updates
### Frequency
Weekly
### Estimated Time
3060 minutes
### Objective
Maintain security posture and system stability.
### Tasks
#### Linux Updates
```bash
sudo apt update
sudo apt upgrade
```
#### Docker
- Update container images
- Rebuild containers if necessary
- Remove unused images
#### Validate:
- Nextcloud functionality after updates
- Database connectivity
- Reverse proxy operation
### Important
Do NOT apply major-version upgrades during business hours.
### Deliverable
- Patch log
- Change log entry
---
## 4.2 Nextcloud Maintenance Review
### Frequency
Weekly
### Estimated Time
2030 minutes
### Tasks
#### Review:
- Security warnings
- Integrity check results
- App updates
- Background jobs
- Storage consumption
#### Validate:
- Cron jobs functioning
- File scanning healthy
- No database corruption warnings
#### Execute
```bash
docker exec -it nextcloud-app php occ status
```
### Deliverable
- Weekly maintenance report
---
## 4.3 Backup Restore Test
### Frequency
Weekly
### Estimated Time
3060 minutes
### Objective
Prove recoverability.
### Tasks
Restore:
- Single file
- Database dump
- User folder sample
### Verify:
- File integrity
- Permissions
- Recovery speed
### Critical Principle
If restore testing is not performed, liability exposure increases substantially.
### Deliverable
- Restore validation report
---
## 4.4 Security Audit Review
### Frequency
Weekly
### Estimated Time
30 minutes
### Tasks
#### Review:
- Admin accounts
- Group memberships
- External shares
- Public links
- Expired accounts
- MFA compliance
#### Validate:
- SSL certificate expiration dates
- Firewall rules
- SSH access
- Root login disabled
- Fail2Ban status (if implemented)
### Deliverable
- Weekly security audit checklist
---
## 4.5 Capacity and Performance Review
### Frequency
Weekly
### Estimated Time
2030 minutes
### Tasks
#### Analyze:
- Storage growth
- User growth
- Bandwidth usage
- CPU/RAM trends
- Database size growth
#### Evaluate:
- Need for droplet resize
- Need for archive policies
- Need for retention changes
### Deliverable
- Capacity trend notes
---
## 4.6 Documentation and Change Log
### Frequency
Weekly
### Estimated Time
1520 minutes
### Objective
Maintain defensible operational records.
### Tasks
Document:
- Changes made
- Accounts added/removed
- Incidents
- Security events
- Backup issues
- Maintenance performed
### Important
Operational documentation is part of liability protection.
If a breach occurs, documented operational diligence matters significantly.
### Deliverable
- Weekly operational summary
---
# 5. Monthly Administrative Tasks
---
## 5.1 Full Disaster Recovery Exercise
### Estimated Time
24 hours
### Tasks
Simulate:
- Server loss
- Container rebuild
- Restore from backup
- DNS validation
- SSL restoration
---
## 5.2 User Access Certification
### Estimated Time
3060 minutes
### Tasks
Review with client:
- Active users
- Admin privileges
- External sharing
- Terminated employees
---
## 5.3 Security Policy Review
### Estimated Time
30 minutes
### Tasks
Review:
- MFA compliance
- Password standards
- Administrative access
- Training completion
---
# 6. Estimated Operational Effort
| Activity | Estimated Time |
|---|---|
| Daily Operations | 3560 min/day |
| Weekly Maintenance | 24 hrs/week |
| Monthly DR/Security | 36 hrs/month |
---
# 7. Recommended Retainer Guidance
For a pilot of this size:
| Service Level | Estimated Monthly Hours |
|---|---|
| Minimal Reactive Support | 810 hrs |
| Recommended Operational Support | 1520 hrs |
| Security-Conscious Managed Support | 2535 hrs |
Given the recent discussions around:
- liability
- data protection
- backup validation
- MFA enforcement
- user training
- documented diligence
…the “Recommended Operational Support” tier is likely the minimum responsible posture.
---
# 8. Key Risk Areas to Monitor
The largest liability exposure areas are:
## Administrative Misconfiguration
- Incorrect sharing permissions
- Public links
- Excessive admin rights
## Backup Failure
- Silent backup corruption
- Unverified restores
## Credential Compromise
- Weak passwords
- MFA disabled
- Phishing
## Delayed Patching
- Unpatched Nextcloud vulnerabilities
- Docker/container CVEs
- Linux exploits
## User Behavior
- Unsafe uploads
- Credential reuse
- Local machine compromise
## Lack of Documentation
- No operational evidence
- No audit trail
- Undefined responsibilities
---
# 9. Strong Recommendations
## Require:
- MFA for all users
- Mandatory admin training
- Signed acceptable use/security acknowledgment
- Principle of least privilege
## Strongly Recommended:
- Centralized logging
- Automated monitoring alerts
- Offsite backups
- Written incident response plan
- Cyber liability / E&O insurance
## Avoid:
- Shared admin accounts
- Permanent public links
- Unrestricted upload folders
- Direct root SSH access
- Unmanaged personal devices for administrators