Nexus — Document Intelligence

Nexus is Sentinel’s AI-powered document intelligence engine. It transforms unstructured financial documents — Consolidated Account Statements (CAS), portfolio statements, account summaries, and other wealth management documents — into structured, queryable data through a fully automated 10-stage pipeline.

Upload a PDF. Nexus reads every page, classifies the document type, extracts layout and hierarchy, identifies financial entities, pulls out securities holdings with ISINs, NAVs, and market values, cross-validates the results, and delivers clean structured data ready for export. No manual data entry. No spreadsheet wrangling.

Pipeline Flow

The Nexus pipeline follows a linear request-response pattern. Upload a document, start processing, poll for progress, then fetch the structured extraction.

graph LR
    Upload["Upload PDF<br/>POST /nexus/upload/"] --> Process["Start Pipeline<br/>POST /v2/nexus/pipeline/process"]
    Process --> Poll["Poll Progress<br/>GET /v2/nexus/status/process/{id}/progress"]
    Poll -->|"100%"| Extract["Fetch Extraction<br/>GET /v1/nexus/documents/{id}/extraction"]
    Extract --> Export["Export Data<br/>GET /v1/nexus/export/{job_id}/fields"]

    classDef primary fill:#dbeafe,stroke:#3b82f6,color:#1e293b
    classDef secondary fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    classDef success fill:#d1fae5,stroke:#10b981,color:#1e293b

    class Upload primary
    class Process,Poll secondary
    class Extract,Export success

Step-by-step:

Upload — User selects a PDF file. The frontend sends it as multipart form data to the upload endpoint. The backend returns a job_id.
Process — The frontend triggers the V2 pipeline with the job_id. The backend returns both a job_id and a process_id (V2-specific). The process_id is stored in the frontend’s nexus store for progress tracking.
Poll — The frontend polls the progress endpoint using the process_id at regular intervals. Each response includes the current stage name, percentage complete, and per-stage status.
Extract — Once progress reaches 100%, the frontend fetches the full extraction payload. This contains all parsed securities, asset allocation, account metadata, and document classification.
Export — Users can export the structured data as holdings or underlying securities with selectable fields.

Document Lifecycle

Documents transition through a series of states from initial upload to final export. The state diagram below shows every possible transition, including error recovery and manual review paths.

stateDiagram-v2
    [*] --> Uploaded: POST /nexus/upload
    Uploaded --> Processing: POST /v2/pipeline/process
    Processing --> Completed: All 10 stages pass
    Processing --> Failed: Stage error
    Completed --> Exported: Export to Excel/JSON
    Failed --> Processing: Retry
    Completed --> NeedsReview: Low confidence
    NeedsReview --> Completed: Manual review

A document enters NeedsReview when the validate_insights stage produces confidence scores below the threshold. After a user reviews and confirms the extraction, it moves back to Completed and becomes eligible for export.

The 10 Pipeline Stages

Every document passes through all 10 stages sequentially. The frontend displays an animated timeline showing real-time progress through each stage.

#	Stage	Description
1	load_split	Load the uploaded PDF and split it into individual pages. Each page is prepared for independent analysis.
2	classify	Determine the document type — CAS (Consolidated Account Statement), account statement, portfolio report, capital gains statement, etc. Classification drives downstream extraction logic.
3	layout_extract	Extract the visual layout structure from each page. Identifies tables, headers, paragraphs, and spatial relationships between elements.
4	document_structure	Parse the document’s logical hierarchy. Maps sections, subsections, and content blocks into a structured tree.
5	entity_detect	Identify financial entities within the document — fund names, AMC (Asset Management Company) names, ISIN codes, folio numbers, account identifiers.
6	metadata_extract	Extract account-level metadata: account holder name, PAN, AMC details, statement dates, folio numbers, and contact information.
7	info_extract	The core extraction stage. Pulls out individual securities with their holdings data — scheme names, ISIN codes, units held, NAV (Net Asset Value), market value, invested value, and gain/loss figures.
8	validate_insights	Cross-validate extracted data for consistency. Checks that unit counts multiplied by NAV match market values. Flags discrepancies and generates confidence scores.
9	store	Persist the validated extraction results to the backend database. The document is now queryable and available for export.
10	complete	Pipeline finished. All stages have completed successfully. The document detail page is now fully populated.

Pipeline Stage Detail

The following diagram visualizes the data each stage produces as it flows through the pipeline. Early stages analyze structure; middle stages extract financial entities and holdings; final stages validate, persist, and mark completion.

graph TD
    S1["1. Load Split<br/>Pages separated"] --> S2["2. Classify<br/>Document type identified"]
    S2 --> S3["3. Layout Extract<br/>Visual structure mapped"]
    S3 --> S4["4. Document Structure<br/>Hierarchy detected"]
    S4 --> S5["5. Entity Detect<br/>Names, PANs, accounts"]
    S5 --> S6["6. Metadata Extract<br/>AMC, folio, dates"]
    S6 --> S7["7. Info Extract<br/>Securities, NAVs, values"]
    S7 --> S8["8. Validate Insights<br/>Cross-checks applied"]
    S8 --> S9["9. Store<br/>Data persisted"]
    S9 --> S10["10. Complete<br/>Ready for export"]

    classDef primary fill:#dbeafe,stroke:#3b82f6,color:#1e293b
    classDef highlight fill:#fae8ff,stroke:#a855f7,color:#1e293b
    classDef warning fill:#fef3c7,stroke:#f59e0b,color:#1e293b
    classDef success fill:#d1fae5,stroke:#10b981,color:#1e293b

    class S1,S2,S3,S4 primary
    class S5,S6,S7 highlight
    class S8 warning
    class S9,S10 success

Stages 1-4 (blue shades) handle document ingestion and structural analysis. Stages 5-7 (purple/amber) perform entity and financial data extraction. Stage 8 (pink) applies validation rules. Stages 9-10 (green) persist results and finalize the pipeline.

Extraction Data

Once the pipeline completes, Nexus delivers structured data across four categories.

Securities Holdings

Each security extracted from the document includes:

Field	Description	Example
`scheme_name`	Full name of the mutual fund or security	SBI Bluechip Fund - Direct Plan - Growth
`isin`	International Securities Identification Number	INF200K01RJ1
`units`	Number of units held	245.678
`nav`	Net Asset Value per unit	78.45
`market_value`	Current market value (units x NAV)	19,268.41
`invested_value`	Total amount originally invested	15,000.00
`gain_loss`	Absolute gain or loss	4,268.41
`gain_loss_pct`	Percentage return	28.46%
`folio_number`	Folio number for the holding	1234567890
`amc`	Asset Management Company	SBI Funds Management

Asset Allocation

Sector-level breakdown of the portfolio:

Equity percentage
Debt percentage
Hybrid/balanced percentage
Sector-wise allocation within equity (Large Cap, Mid Cap, Small Cap, Multi Cap)

Account Metadata

Document-level information:

account_holder_name — Name of the investor
pan — PAN card number (masked in UI)
amc — Asset Management Company name
folio_numbers — All folio numbers found in the document
statement_period — Date range the statement covers
statement_date — Date the statement was generated

Document Classification

product_type — Broad category (Mutual Fund, Insurance, PMS, AIF)
document_subtype — Specific type (CAS, Account Statement, Capital Gains Statement, Transaction Statement)
confidence_score — Classification confidence (0.0 to 1.0)

AIF Extraction Support

Nexus provides comprehensive extraction support for Alternative Investment Fund (AIF) documents with flexible schema handling to accommodate multiple fund house formats.

Supported AIF Formats

AIF statements vary widely in structure across different Asset Management Companies (AMCs). Nexus handles 5+ format variations through a format-agnostic architecture:

Format	Description	Example AMCs
AIFExtraction	Portfolio-level extraction with portfolio companies, NAV, and performance metrics (TVPI, DPI, RVPI, MOIC, XIRR)	Generic AIF funds
AIFStatementExtraction	New NEXUS format with normalized account info, holdings summary, and transaction history. Distinguished by `transaction_count` field	Motilal Oswal, Axis (new format)
AIFAccountStatementExtraction	Flexible wrapper format with `AccountStatement` key containing AMC-specific fields	Axis RERA, Motilal Oswal (legacy)
AIFFlatStatementExtraction	Flat snake_case format without wrappers. Covers Format C (account_holder), Format D (personal_information), and future variants	Various AMCs
NexusAifSoaExtraction	Generic AIF Statement of Account (SOA) format with standardized investor details and holdings summary	Generic SOA documents

Portfolio Companies Data

For AIF documents with underlying portfolio companies, Nexus extracts:

Field	Description	Example
`name`	Portfolio company name	TechCorp Private Ltd
`sector`	Business sector	Technology
`investment_date`	Initial investment date	2023-06-15
`invested_amount`	Total capital invested	50,00,000
`current_value`	Current NAV of investment	72,00,000
`moic`	Multiple on Invested Capital	1.44x
`irr`	Internal Rate of Return	18.5%
`status`	Investment status	Active / Exited / Partial Exit

The Portfolio Companies Table in the UI provides:

Sortable columns (by name, sector, invested amount, current value, MOIC, IRR)
Search functionality across company names and sectors
Aggregated portfolio statistics (total companies, total invested, total NAV)

AIF Transaction History

Transaction-level data extracted from AIF statements:

Field	Description
`date`	Transaction date
`description`	Transaction description
`transaction_type`	Capital call / Distribution / NAV update
`units`	Units affected
`nav`	NAV per unit at transaction
`amount`	Transaction amount
`charges`	Associated fees/charges

AIF Account Information

Investor details extracted from AIF statements:

Account Statement: Folio number, scheme name, statement period, PAN, advisor details
Investor Details: Primary holder, joint holders (1-3), guardian, nominees (with allocation percentages)
Holdings Summary: Total commitment, capital contribution, uncalled commitment, units allotted, current NAV, valuation
Capital Summary: Total commitment, called capital, uncalled capital, distributed amount
Performance Metrics: TVPI (Total Value to Paid-In), DPI (Distributions to Paid-In), RVPI (Residual Value to Paid-In), MOIC, XIRR

Format Normalization

Nexus automatically normalizes different AIF formats to a unified structure for consistent rendering:

Field Name Mapping:

account_info → account_statement
holding_summary (singular) → holdings_summary (plural)
capital_commitment → commitment_amount
capital_contribution → contribution_received_amount
units_allocated → units_allotted
unit_class → class_name
held → is_held

Structural Transformations:

Individual holder objects (holder_1, holder_2, holder_3) → investor_details with typed fields
Individual nominee objects (nominee_1, nominee_2, nominee_3) → nominees array
Nested transaction structures → flattened transaction arrays

This normalization layer allows new AIF formats to be added without breaking existing UI components.

Export System

Nexus supports two export modes, each producing a downloadable dataset with selectable fields.

Export Flow

The diagram below shows the two export paths. Both start from a completed extraction, but they produce different output granularities — portfolio-level holdings vs. the underlying constituent securities within each holding.

flowchart LR
    subgraph Extraction["Completed Extraction"]
        EX["Structured Data<br/>from Pipeline"]
    end

    EX --> FieldsAPI["GET /export/{job_id}/fields<br/>Fetch available fields + coverage"]

    FieldsAPI --> HoldingsPath["Holdings Export"]
    FieldsAPI --> UnderlyingPath["Underlying Export"]

    subgraph HoldingsPath["Holdings Export"]
        H1["Select fields:<br/>scheme, ISIN, units,<br/>NAV, market_value"]
        H1 --> H2["Download<br/>1 row per holding"]
    end

    subgraph UnderlyingPath["Underlying Export"]
        U1["Select fields:<br/>parent_scheme, sector,<br/>allocation_pct"]
        U1 --> U2["Download<br/>1 row per constituent"]
    end

    classDef primary fill:#dbeafe,stroke:#3b82f6,color:#1e293b
    classDef secondary fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    classDef success fill:#d1fae5,stroke:#10b981,color:#1e293b
    classDef highlight fill:#fae8ff,stroke:#a855f7,color:#1e293b

    class EX primary
    class FieldsAPI secondary
    class H1,H2 success
    class U1,U2 highlight

Holdings Export

Exports portfolio-level holdings data. Each row is one security holding with its valuation.

Typical fields: scheme_name, isin, folio_number, units, nav, market_value, invested_value, gain_loss, gain_loss_pct, amc

Underlying Export

Exports the underlying securities within each holding. For fund-of-funds or multi-asset products, this breaks down the constituent securities.

Typical fields: parent_scheme, underlying_security, sector, allocation_pct, market_value

Field Selection

The export UI presents all available fields with coverage percentages. Coverage indicates what percentage of extracted records have a non-null value for that field. For example:

scheme_name — 100% coverage
isin — 98% coverage
nav — 95% coverage
gain_loss_pct — 82% coverage

Users select which fields to include before downloading.

Endpoint: GET /v1/nexus/export/{job_id}/fields returns available fields and coverage. The actual export download uses the selected field list as query parameters.

Token Usage and Costs

Nexus tracks LLM token usage at every pipeline stage. This gives full transparency into AI processing costs per document.

Each stage records:

Metric	Description
`input_tokens`	Number of tokens sent to the model
`output_tokens`	Number of tokens generated by the model
`model`	Model identifier used for that stage (e.g., GPT-4o, Claude)
`cost_usd`	Cost in US dollars for that stage
`cost_inr`	Cost in Indian Rupees for that stage

The Costs tab on the document detail page displays:

Per-stage token breakdown in a table
Total input and output tokens across all stages
Aggregate cost in both USD and INR
Model used at each stage (different stages may use different models for cost optimization)

Document Detail Page

After processing, each document has a dedicated detail page with five tabs providing complete visibility into the extraction.

Tab 1: Pipeline

An animated vertical timeline showing all 10 stages. Each stage displays:

Stage name and description
Status indicator (completed, in-progress, pending, failed)
Duration taken
Real-time animation during active processing

The timeline updates live while the pipeline is running, giving users immediate feedback on progress.

Tab 2: Data

The core extraction view. Presents structured data in interactive tables:

Securities table — Sortable, searchable list of all extracted holdings. Click any security row to expand and see full details including ISIN, folio, invested value, and gain/loss.
Summary cards — Total portfolio value, total invested, overall gain/loss, number of securities found.
Asset allocation chart — Visual breakdown by asset class.

Tab 3: Insights

AI-generated analysis of the extracted document:

Document summary — Natural language overview of what was found
Entity details — All detected entities (AMCs, account holders, folios) with their locations in the document
Data quality notes — Any validation warnings or low-confidence extractions flagged during validate_insights

Tab 4: Costs

Token usage and cost breakdown (see Token Usage section above). Displays a table with one row per pipeline stage showing input tokens, output tokens, model, and cost.

Tab 5: Export

Field selection interface for downloading extracted data:

Toggle between Holdings and Underlying export types
Checkbox list of all available fields with coverage percentages
Select All / Deselect All controls
Download button generates the export file

Frontend Routes

Route	Description
`/nexus`	Overview page. Shows upload action and recent document processing activity.
`/nexus/upload`	Upload interface. Drag-and-drop or file picker for PDF documents. Triggers pipeline on upload.
`/nexus/documents`	Document history list. All previously processed documents with status, date, and document type. Searchable and sortable. The Type column displays the AI-classified document type (e.g. Factsheet, Holdings, Snapshot) sourced from `document_type`. The list auto-refreshes every 30 seconds via `X-Poll-Interval` header.
`/nexus/documents/[docId]`	Document detail page with the 5-tab layout (Pipeline, Data, Insights, Costs, Export).

API Endpoints Reference

Method	Endpoint	Description
`POST`	`/api/v1/nexus/upload/`	Upload a PDF document (multipart form data)
`POST`	`/api/v2/nexus/pipeline/process`	Start the 10-stage pipeline. Returns `job_id` and `process_id`
`GET`	`/api/v2/nexus/status/process/{process_id}/progress`	Poll pipeline progress. Returns stage name, percentage, per-stage status
`GET`	`/api/v1/documents/me`	List documents belonging to the authenticated user. Polled every 30 s via `X-Poll-Interval: 30` response header.
`GET`	`/api/v1/nexus/documents/{doc_id}/extraction`	Fetch full extraction results for a completed document
`GET`	`/api/v1/nexus/export/{job_id}/fields`	Get available export fields with coverage percentages
`GET`	`/api/v1/nexus/doc-fetch/{job_id}?type=ORIGINAL`	Download the original uploaded PDF
`GET`	`/api/v1/nexus/stats`	Admin-only aggregate statistics across all processed documents

All endpoints route through the Studio BFF gateway (studio-backend-dev.centricitywealth.tech). The frontend never calls Nexus backend directly.

User-scoped document listing: As of v6.2, getDocuments() calls GET /documents/me instead of GET /documents/. This returns only the authenticated user’s documents rather than the full collection. Query parameters (limit, skip, product_type) are forwarded as a URL query string.

Document type column: As of v6.3, the Type column in the documents list displays document_type (the AI-classified document category such as factsheet, holdingstatement, portfoliosnapshot) instead of product_type. Each type renders with a distinct colour badge. If document_type is absent in the API response, the column shows -.