Live App →

MongoDB (DocumentDB)

Document store for unstructured and semi-structured data.


Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        Application Layer                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
│  │  Extraction  │  │   Chat API   │  │ Agent Engine │              │
│  │   Service    │  │              │  │              │              │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘              │
└─────────┼─────────────────┼─────────────────┼──────────────────────┘
          │   readPreference│  primaryPreferred│  secondaryPreferred  │
          │   (secondary)   │                  │                      │
          ▼                 ▼                  ▼                      │
┌─────────────────────────────────────────────────────────────────────┐
│              Amazon DocumentDB 5.0 — 3-Node Replica Set             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐             │
│  │   Primary   │◄──►│ Secondary 1 │◄──►│ Secondary 2 │             │
│  │  (writes)   │    │  (reads)    │    │  (reads)    │             │
│  └─────────────┘    └─────────────┘    └─────────────┘             │
│         ▲                                                           │
│    Daily Snapshots ──► S3 (cross-region copy)                       │
└─────────────────────────────────────────────────────────────────────┘

Use Cases

Collection Purpose Example Document
documents Extraction results CAS holdings, metadata
chat_sessions Message history Messages, timestamps, token usage
agents Agent configurations System prompts, tool lists, guardrails
workflows Workflow definitions DAG nodes, edges, parameters
users User profiles Email, role, tenant, preferences

Configuration

  • Engine: Amazon DocumentDB 5.0 (MongoDB-compatible)
  • Instance: db.r6g.xlarge (prod), db.t3.medium (dev)
  • Replication: 3-node cluster across AZs
  • Backup: Daily snapshots, 7-day retention
  • Encryption: AES-256 at rest, TLS in transit

Connection Patterns & Client Libraries

Primary Driver: pymongo — version >=4.6.0

from pymongo import MongoClient, ReadPreference

client = MongoClient(
    "mongodb://user:pass@host1:27017,host2:27017,host3:27017/?"
    "replicaSet=rs0&readPreference=secondaryPreferred&tls=true",
    maxPoolSize=100,
    minPoolSize=10,
    maxIdleTimeMS=60000,
    serverSelectionTimeoutMS=5000
)

Read Preferences by Service:

Service Read Preference Reason
Extraction Service secondaryPreferred Heavy read load, eventual consistency acceptable
Chat API primaryPreferred Needs recent writes for session continuity
Agent Engine secondaryPreferred Config reads are mostly immutable

Connection Pooling:

  • maxPoolSize=100 per application instance.
  • Monitor connections.current vs. connections.available to avoid exhaustion.

Backup & Disaster Recovery

Strategy Frequency Retention RPO RTO
Automated Snapshots Daily 7 days 24 hours < 30 minutes
Manual Snapshots Pre-release Indefinite Zero < 30 minutes
Cross-Region Snapshot Copy Daily (prod) 7 days 24 hours < 1 hour
Point-in-Time Recovery (PITR) Continuous 7 days 5 minutes < 30 minutes

DR Runbook:

  1. Identify latest valid snapshot in DR region.
  2. Restore to a new DocumentDB cluster: aws docdb restore-db-cluster-to-point-in-time.
  3. Update application connection strings via Parameter Store / Secrets Manager.
  4. Verify replica set health with rs.status() before redirecting traffic.

Performance Tuning

Indexing Strategy:

Collection Index Type Notes
documents { tenant_id: 1, created_at: -1 } Compound Supports tenant-scoped queries
chat_sessions { session_id: 1 } Unique Fast session lookups
chat_sessions { user_id: 1, updated_at: -1 } Compound Recent chat list
agents { agent_id: 1 } Unique Primary key
workflows { workflow_id: 1, version: -1 } Compound Versioned workflow fetch

TTL Indexes:

  • chat_sessions messages: expire after 90 days (expireAfterSeconds: 7776000).
  • audit_events staging: expire after 30 days.

Query Profiling:

  • Enable profiler threshold at 100 ms in dev/staging to catch unindexed queries.
  • Review slow logs weekly via CloudWatch Logs Insights.

Monitoring & Alerting

CloudWatch Metrics:

Metric Threshold Severity
CPUUtilization > 80% for 5 min Warning
CPUUtilization > 95% for 2 min Critical
FreeableMemory < 10% of total Critical
DatabaseConnections > 400 (prod) Warning
WriteLatency p99 > 20 ms Warning
ReadLatency p99 > 10 ms Warning
VolumeBytesUsed > 80% of provisioned Warning

Log Exports:

  • Audit logs → CloudWatch Logs → OpenSearch for SIEM correlation.
  • Profiler logs → weekly Athena query for index recommendations.