Nexus — Document Intelligence
AI-powered document processing pipeline for financial documents
Nexus — Document Intelligence
Nexus is Sentinel’s AI-powered document intelligence engine. It transforms unstructured financial documents — Consolidated Account Statements (CAS), portfolio statements, account summaries, and other wealth management documents — into structured, queryable data through a fully automated 10-stage pipeline.
Upload a PDF. Nexus reads every page, classifies the document type, extracts layout and hierarchy, identifies financial entities, pulls out securities holdings with ISINs, NAVs, and market values, cross-validates the results, and delivers clean structured data ready for export. No manual data entry. No spreadsheet wrangling.
Pipeline Flow
The Nexus pipeline follows a linear request-response pattern. Upload a document, start processing, poll for progress, then fetch the structured extraction.
graph LR
Upload["Upload PDF<br/>POST /nexus/upload/"] --> Process["Start Pipeline<br/>POST /v2/nexus/pipeline/process"]
Process --> Poll["Poll Progress<br/>GET /v2/nexus/status/process/{id}/progress"]
Poll -->|"100%"| Extract["Fetch Extraction<br/>GET /v1/nexus/documents/{id}/extraction"]
Extract --> Export["Export Data<br/>GET /v1/nexus/export/{job_id}/fields"]
classDef primary fill:#dbeafe,stroke:#3b82f6,color:#1e293b
classDef secondary fill:#e0e7ff,stroke:#6366f1,color:#1e293b
classDef success fill:#d1fae5,stroke:#10b981,color:#1e293b
class Upload primary
class Process,Poll secondary
class Extract,Export success
Step-by-step:
- Upload — User selects a PDF file. The frontend sends it as multipart form data to the upload endpoint. The backend returns a
job_id. - Process — The frontend triggers the V2 pipeline with the
job_id. The backend returns both ajob_idand aprocess_id(V2-specific). Theprocess_idis stored in the frontend’s nexus store for progress tracking. - Poll — The frontend polls the progress endpoint using the
process_idat regular intervals. Each response includes the current stage name, percentage complete, and per-stage status. - Extract — Once progress reaches 100%, the frontend fetches the full extraction payload. This contains all parsed securities, asset allocation, account metadata, and document classification.
- Export — Users can export the structured data as holdings or underlying securities with selectable fields.
Document Lifecycle
Documents transition through a series of states from initial upload to final export. The state diagram below shows every possible transition, including error recovery and manual review paths.
stateDiagram-v2
[*] --> Uploaded: POST /nexus/upload
Uploaded --> Processing: POST /v2/pipeline/process
Processing --> Completed: All 10 stages pass
Processing --> Failed: Stage error
Completed --> Exported: Export to Excel/JSON
Failed --> Processing: Retry
Completed --> NeedsReview: Low confidence
NeedsReview --> Completed: Manual review
A document enters NeedsReview when the validate_insights stage produces confidence scores below the threshold. After a user reviews and confirms the extraction, it moves back to Completed and becomes eligible for export.
The 10 Pipeline Stages
Every document passes through all 10 stages sequentially. The frontend displays an animated timeline showing real-time progress through each stage.
| # | Stage | Description |
|---|---|---|
| 1 | load_split | Load the uploaded PDF and split it into individual pages. Each page is prepared for independent analysis. |
| 2 | classify | Determine the document type — CAS (Consolidated Account Statement), account statement, portfolio report, capital gains statement, etc. Classification drives downstream extraction logic. |
| 3 | layout_extract | Extract the visual layout structure from each page. Identifies tables, headers, paragraphs, and spatial relationships between elements. |
| 4 | document_structure | Parse the document’s logical hierarchy. Maps sections, subsections, and content blocks into a structured tree. |
| 5 | entity_detect | Identify financial entities within the document — fund names, AMC (Asset Management Company) names, ISIN codes, folio numbers, account identifiers. |
| 6 | metadata_extract | Extract account-level metadata: account holder name, PAN, AMC details, statement dates, folio numbers, and contact information. |
| 7 | info_extract | The core extraction stage. Pulls out individual securities with their holdings data — scheme names, ISIN codes, units held, NAV (Net Asset Value), market value, invested value, and gain/loss figures. |
| 8 | validate_insights | Cross-validate extracted data for consistency. Checks that unit counts multiplied by NAV match market values. Flags discrepancies and generates confidence scores. |
| 9 | store | Persist the validated extraction results to the backend database. The document is now queryable and available for export. |
| 10 | complete | Pipeline finished. All stages have completed successfully. The document detail page is now fully populated. |
Pipeline Stage Detail
The following diagram visualizes the data each stage produces as it flows through the pipeline. Early stages analyze structure; middle stages extract financial entities and holdings; final stages validate, persist, and mark completion.
graph TD
S1["1. Load Split<br/>Pages separated"] --> S2["2. Classify<br/>Document type identified"]
S2 --> S3["3. Layout Extract<br/>Visual structure mapped"]
S3 --> S4["4. Document Structure<br/>Hierarchy detected"]
S4 --> S5["5. Entity Detect<br/>Names, PANs, accounts"]
S5 --> S6["6. Metadata Extract<br/>AMC, folio, dates"]
S6 --> S7["7. Info Extract<br/>Securities, NAVs, values"]
S7 --> S8["8. Validate Insights<br/>Cross-checks applied"]
S8 --> S9["9. Store<br/>Data persisted"]
S9 --> S10["10. Complete<br/>Ready for export"]
classDef primary fill:#dbeafe,stroke:#3b82f6,color:#1e293b
classDef highlight fill:#fae8ff,stroke:#a855f7,color:#1e293b
classDef warning fill:#fef3c7,stroke:#f59e0b,color:#1e293b
classDef success fill:#d1fae5,stroke:#10b981,color:#1e293b
class S1,S2,S3,S4 primary
class S5,S6,S7 highlight
class S8 warning
class S9,S10 success
Stages 1-4 (blue shades) handle document ingestion and structural analysis. Stages 5-7 (purple/amber) perform entity and financial data extraction. Stage 8 (pink) applies validation rules. Stages 9-10 (green) persist results and finalize the pipeline.
Extraction Data
Once the pipeline completes, Nexus delivers structured data across four categories.
Securities Holdings
Each security extracted from the document includes:
| Field | Description | Example |
|---|---|---|
scheme_name |
Full name of the mutual fund or security | SBI Bluechip Fund - Direct Plan - Growth |
isin |
International Securities Identification Number | INF200K01RJ1 |
units |
Number of units held | 245.678 |
nav |
Net Asset Value per unit | 78.45 |
market_value |
Current market value (units x NAV) | 19,268.41 |
invested_value |
Total amount originally invested | 15,000.00 |
gain_loss |
Absolute gain or loss | 4,268.41 |
gain_loss_pct |
Percentage return | 28.46% |
folio_number |
Folio number for the holding | 1234567890 |
amc |
Asset Management Company | SBI Funds Management |
Asset Allocation
Sector-level breakdown of the portfolio:
- Equity percentage
- Debt percentage
- Hybrid/balanced percentage
- Sector-wise allocation within equity (Large Cap, Mid Cap, Small Cap, Multi Cap)
Account Metadata
Document-level information:
- account_holder_name — Name of the investor
- pan — PAN card number (masked in UI)
- amc — Asset Management Company name
- folio_numbers — All folio numbers found in the document
- statement_period — Date range the statement covers
- statement_date — Date the statement was generated
Document Classification
- product_type — Broad category (Mutual Fund, Insurance, PMS, AIF)
- document_subtype — Specific type (CAS, Account Statement, Capital Gains Statement, Transaction Statement)
- confidence_score — Classification confidence (0.0 to 1.0)
AIF Extraction Support
Nexus provides comprehensive extraction support for Alternative Investment Fund (AIF) documents with flexible schema handling to accommodate multiple fund house formats.
Supported AIF Formats
AIF statements vary widely in structure across different Asset Management Companies (AMCs). Nexus handles 5+ format variations through a format-agnostic architecture:
| Format | Description | Example AMCs |
|---|---|---|
| AIFExtraction | Portfolio-level extraction with portfolio companies, NAV, and performance metrics (TVPI, DPI, RVPI, MOIC, XIRR) | Generic AIF funds |
| AIFStatementExtraction | New NEXUS format with normalized account info, holdings summary, and transaction history. Distinguished by transaction_count field |
Motilal Oswal, Axis (new format) |
| AIFAccountStatementExtraction | Flexible wrapper format with AccountStatement key containing AMC-specific fields |
Axis RERA, Motilal Oswal (legacy) |
| AIFFlatStatementExtraction | Flat snake_case format without wrappers. Covers Format C (account_holder), Format D (personal_information), and future variants | Various AMCs |
| NexusAifSoaExtraction | Generic AIF Statement of Account (SOA) format with standardized investor details and holdings summary | Generic SOA documents |
Portfolio Companies Data
For AIF documents with underlying portfolio companies, Nexus extracts:
| Field | Description | Example |
|---|---|---|
name |
Portfolio company name | TechCorp Private Ltd |
sector |
Business sector | Technology |
investment_date |
Initial investment date | 2023-06-15 |
invested_amount |
Total capital invested | 50,00,000 |
current_value |
Current NAV of investment | 72,00,000 |
moic |
Multiple on Invested Capital | 1.44x |
irr |
Internal Rate of Return | 18.5% |
status |
Investment status | Active / Exited / Partial Exit |
The Portfolio Companies Table in the UI provides:
- Sortable columns (by name, sector, invested amount, current value, MOIC, IRR)
- Search functionality across company names and sectors
- Aggregated portfolio statistics (total companies, total invested, total NAV)
AIF Transaction History
Transaction-level data extracted from AIF statements:
| Field | Description |
|---|---|
date |
Transaction date |
description |
Transaction description |
transaction_type |
Capital call / Distribution / NAV update |
units |
Units affected |
nav |
NAV per unit at transaction |
amount |
Transaction amount |
charges |
Associated fees/charges |
AIF Account Information
Investor details extracted from AIF statements:
- Account Statement: Folio number, scheme name, statement period, PAN, advisor details
- Investor Details: Primary holder, joint holders (1-3), guardian, nominees (with allocation percentages)
- Holdings Summary: Total commitment, capital contribution, uncalled commitment, units allotted, current NAV, valuation
- Capital Summary: Total commitment, called capital, uncalled capital, distributed amount
- Performance Metrics: TVPI (Total Value to Paid-In), DPI (Distributions to Paid-In), RVPI (Residual Value to Paid-In), MOIC, XIRR
Format Normalization
Nexus automatically normalizes different AIF formats to a unified structure for consistent rendering:
Field Name Mapping:
account_info→account_statementholding_summary(singular) →holdings_summary(plural)capital_commitment→commitment_amountcapital_contribution→contribution_received_amountunits_allocated→units_allottedunit_class→class_nameheld→is_held
Structural Transformations:
- Individual holder objects (holder_1, holder_2, holder_3) →
investor_detailswith typed fields - Individual nominee objects (nominee_1, nominee_2, nominee_3) →
nomineesarray - Nested transaction structures → flattened transaction arrays
This normalization layer allows new AIF formats to be added without breaking existing UI components.
Export System
Nexus supports two export modes, each producing a downloadable dataset with selectable fields.
Export Flow
The diagram below shows the two export paths. Both start from a completed extraction, but they produce different output granularities — portfolio-level holdings vs. the underlying constituent securities within each holding.
flowchart LR
subgraph Extraction["Completed Extraction"]
EX["Structured Data<br/>from Pipeline"]
end
EX --> FieldsAPI["GET /export/{job_id}/fields<br/>Fetch available fields + coverage"]
FieldsAPI --> HoldingsPath["Holdings Export"]
FieldsAPI --> UnderlyingPath["Underlying Export"]
subgraph HoldingsPath["Holdings Export"]
H1["Select fields:<br/>scheme, ISIN, units,<br/>NAV, market_value"]
H1 --> H2["Download<br/>1 row per holding"]
end
subgraph UnderlyingPath["Underlying Export"]
U1["Select fields:<br/>parent_scheme, sector,<br/>allocation_pct"]
U1 --> U2["Download<br/>1 row per constituent"]
end
classDef primary fill:#dbeafe,stroke:#3b82f6,color:#1e293b
classDef secondary fill:#e0e7ff,stroke:#6366f1,color:#1e293b
classDef success fill:#d1fae5,stroke:#10b981,color:#1e293b
classDef highlight fill:#fae8ff,stroke:#a855f7,color:#1e293b
class EX primary
class FieldsAPI secondary
class H1,H2 success
class U1,U2 highlight
Holdings Export
Exports portfolio-level holdings data. Each row is one security holding with its valuation.
Typical fields: scheme_name, isin, folio_number, units, nav, market_value, invested_value, gain_loss, gain_loss_pct, amc
Underlying Export
Exports the underlying securities within each holding. For fund-of-funds or multi-asset products, this breaks down the constituent securities.
Typical fields: parent_scheme, underlying_security, sector, allocation_pct, market_value
Field Selection
The export UI presents all available fields with coverage percentages. Coverage indicates what percentage of extracted records have a non-null value for that field. For example:
scheme_name— 100% coverageisin— 98% coveragenav— 95% coveragegain_loss_pct— 82% coverage
Users select which fields to include before downloading.
Endpoint: GET /v1/nexus/export/{job_id}/fields returns available fields and coverage. The actual export download uses the selected field list as query parameters.
Token Usage and Costs
Nexus tracks LLM token usage at every pipeline stage. This gives full transparency into AI processing costs per document.
Each stage records:
| Metric | Description |
|---|---|
input_tokens |
Number of tokens sent to the model |
output_tokens |
Number of tokens generated by the model |
model |
Model identifier used for that stage (e.g., GPT-4o, Claude) |
cost_usd |
Cost in US dollars for that stage |
cost_inr |
Cost in Indian Rupees for that stage |
The Costs tab on the document detail page displays:
- Per-stage token breakdown in a table
- Total input and output tokens across all stages
- Aggregate cost in both USD and INR
- Model used at each stage (different stages may use different models for cost optimization)
Document Detail Page
After processing, each document has a dedicated detail page with five tabs providing complete visibility into the extraction.
Tab 1: Pipeline
An animated vertical timeline showing all 10 stages. Each stage displays:
- Stage name and description
- Status indicator (completed, in-progress, pending, failed)
- Duration taken
- Real-time animation during active processing
The timeline updates live while the pipeline is running, giving users immediate feedback on progress.
Tab 2: Data
The core extraction view. Presents structured data in interactive tables:
- Securities table — Sortable, searchable list of all extracted holdings. Click any security row to expand and see full details including ISIN, folio, invested value, and gain/loss.
- Summary cards — Total portfolio value, total invested, overall gain/loss, number of securities found.
- Asset allocation chart — Visual breakdown by asset class.
Tab 3: Insights
AI-generated analysis of the extracted document:
- Document summary — Natural language overview of what was found
- Entity details — All detected entities (AMCs, account holders, folios) with their locations in the document
- Data quality notes — Any validation warnings or low-confidence extractions flagged during validate_insights
Tab 4: Costs
Token usage and cost breakdown (see Token Usage section above). Displays a table with one row per pipeline stage showing input tokens, output tokens, model, and cost.
Tab 5: Export
Field selection interface for downloading extracted data:
- Toggle between Holdings and Underlying export types
- Checkbox list of all available fields with coverage percentages
- Select All / Deselect All controls
- Download button generates the export file
Frontend Routes
| Route | Description |
|---|---|
/nexus |
Overview page. Shows upload action and recent document processing activity. |
/nexus/upload |
Upload interface. Drag-and-drop or file picker for PDF documents. Triggers pipeline on upload. |
/nexus/documents |
Document history list. All previously processed documents with status, date, and document type. Searchable and sortable. The Type column displays the AI-classified document type (e.g. Factsheet, Holdings, Snapshot) sourced from document_type. The list auto-refreshes every 30 seconds via X-Poll-Interval header. |
/nexus/documents/[docId] |
Document detail page with the 5-tab layout (Pipeline, Data, Insights, Costs, Export). |
API Endpoints Reference
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/v1/nexus/upload/ |
Upload a PDF document (multipart form data) |
POST |
/api/v2/nexus/pipeline/process |
Start the 10-stage pipeline. Returns job_id and process_id |
GET |
/api/v2/nexus/status/process/{process_id}/progress |
Poll pipeline progress. Returns stage name, percentage, per-stage status |
GET |
/api/v1/documents/me |
List documents belonging to the authenticated user. Polled every 30 s via X-Poll-Interval: 30 response header. |
GET |
/api/v1/nexus/documents/{doc_id}/extraction |
Fetch full extraction results for a completed document |
GET |
/api/v1/nexus/export/{job_id}/fields |
Get available export fields with coverage percentages |
GET |
/api/v1/nexus/doc-fetch/{job_id}?type=ORIGINAL |
Download the original uploaded PDF |
GET |
/api/v1/nexus/stats |
Admin-only aggregate statistics across all processed documents |
All endpoints route through the Studio BFF gateway (studio-backend-dev.centricitywealth.tech). The frontend never calls Nexus backend directly.
User-scoped document listing: As of v6.2,
getDocuments()callsGET /documents/meinstead ofGET /documents/. This returns only the authenticated user’s documents rather than the full collection. Query parameters (limit,skip,product_type) are forwarded as a URL query string.
Document type column: As of v6.3, the Type column in the documents list displays
document_type(the AI-classified document category such asfactsheet,holdingstatement,portfoliosnapshot) instead ofproduct_type. Each type renders with a distinct colour badge. Ifdocument_typeis absent in the API response, the column shows-.