Live App →

Nexus — Document Intelligence

Nexus is Sentinel’s AI-powered document intelligence engine. It transforms unstructured financial documents — Consolidated Account Statements (CAS), portfolio statements, account summaries, and other wealth management documents — into structured, queryable data through a fully automated 10-stage pipeline.

Upload a PDF. Nexus reads every page, classifies the document type, extracts layout and hierarchy, identifies financial entities, pulls out securities holdings with ISINs, NAVs, and market values, cross-validates the results, and delivers clean structured data ready for export. No manual data entry. No spreadsheet wrangling.


Pipeline Flow

The Nexus pipeline follows a linear request-response pattern. Upload a document, start processing, poll for progress, then fetch the structured extraction.

graph LR
    Upload["Upload PDF<br/>POST /nexus/upload/"] --> Process["Start Pipeline<br/>POST /v2/nexus/pipeline/process"]
    Process --> Poll["Poll Progress<br/>GET /v2/nexus/status/process/{id}/progress"]
    Poll -->|"100%"| Extract["Fetch Extraction<br/>GET /v1/nexus/documents/{id}/extraction"]
    Extract --> Export["Export Data<br/>GET /v1/nexus/export/{job_id}/fields"]

    classDef primary fill:#dbeafe,stroke:#3b82f6,color:#1e293b
    classDef secondary fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    classDef success fill:#d1fae5,stroke:#10b981,color:#1e293b

    class Upload primary
    class Process,Poll secondary
    class Extract,Export success

Step-by-step:

  1. Upload — User selects a PDF file. The frontend sends it as multipart form data to the upload endpoint. The backend returns a job_id.
  2. Process — The frontend triggers the V2 pipeline with the job_id. The backend returns both a job_id and a process_id (V2-specific). The process_id is stored in the frontend’s nexus store for progress tracking.
  3. Poll — The frontend polls the progress endpoint using the process_id at regular intervals. Each response includes the current stage name, percentage complete, and per-stage status.
  4. Extract — Once progress reaches 100%, the frontend fetches the full extraction payload. This contains all parsed securities, asset allocation, account metadata, and document classification.
  5. Export — Users can export the structured data as holdings or underlying securities with selectable fields.

Document Lifecycle

Documents transition through a series of states from initial upload to final export. The state diagram below shows every possible transition, including error recovery and manual review paths.

stateDiagram-v2
    [*] --> Uploaded: POST /nexus/upload
    Uploaded --> Processing: POST /v2/pipeline/process
    Processing --> Completed: All 10 stages pass
    Processing --> Failed: Stage error
    Completed --> Exported: Export to Excel/JSON
    Failed --> Processing: Retry
    Completed --> NeedsReview: Low confidence
    NeedsReview --> Completed: Manual review

A document enters NeedsReview when the validate_insights stage produces confidence scores below the threshold. After a user reviews and confirms the extraction, it moves back to Completed and becomes eligible for export.


The 10 Pipeline Stages

Every document passes through all 10 stages sequentially. The frontend displays an animated timeline showing real-time progress through each stage.

# Stage Description
1 load_split Load the uploaded PDF and split it into individual pages. Each page is prepared for independent analysis.
2 classify Determine the document type — CAS (Consolidated Account Statement), account statement, portfolio report, capital gains statement, etc. Classification drives downstream extraction logic.
3 layout_extract Extract the visual layout structure from each page. Identifies tables, headers, paragraphs, and spatial relationships between elements.
4 document_structure Parse the document’s logical hierarchy. Maps sections, subsections, and content blocks into a structured tree.
5 entity_detect Identify financial entities within the document — fund names, AMC (Asset Management Company) names, ISIN codes, folio numbers, account identifiers.
6 metadata_extract Extract account-level metadata: account holder name, PAN, AMC details, statement dates, folio numbers, and contact information.
7 info_extract The core extraction stage. Pulls out individual securities with their holdings data — scheme names, ISIN codes, units held, NAV (Net Asset Value), market value, invested value, and gain/loss figures.
8 validate_insights Cross-validate extracted data for consistency. Checks that unit counts multiplied by NAV match market values. Flags discrepancies and generates confidence scores.
9 store Persist the validated extraction results to the backend database. The document is now queryable and available for export.
10 complete Pipeline finished. All stages have completed successfully. The document detail page is now fully populated.

Pipeline Stage Detail

The following diagram visualizes the data each stage produces as it flows through the pipeline. Early stages analyze structure; middle stages extract financial entities and holdings; final stages validate, persist, and mark completion.

graph TD
    S1["1. Load Split<br/>Pages separated"] --> S2["2. Classify<br/>Document type identified"]
    S2 --> S3["3. Layout Extract<br/>Visual structure mapped"]
    S3 --> S4["4. Document Structure<br/>Hierarchy detected"]
    S4 --> S5["5. Entity Detect<br/>Names, PANs, accounts"]
    S5 --> S6["6. Metadata Extract<br/>AMC, folio, dates"]
    S6 --> S7["7. Info Extract<br/>Securities, NAVs, values"]
    S7 --> S8["8. Validate Insights<br/>Cross-checks applied"]
    S8 --> S9["9. Store<br/>Data persisted"]
    S9 --> S10["10. Complete<br/>Ready for export"]

    classDef primary fill:#dbeafe,stroke:#3b82f6,color:#1e293b
    classDef highlight fill:#fae8ff,stroke:#a855f7,color:#1e293b
    classDef warning fill:#fef3c7,stroke:#f59e0b,color:#1e293b
    classDef success fill:#d1fae5,stroke:#10b981,color:#1e293b

    class S1,S2,S3,S4 primary
    class S5,S6,S7 highlight
    class S8 warning
    class S9,S10 success

Stages 1-4 (blue shades) handle document ingestion and structural analysis. Stages 5-7 (purple/amber) perform entity and financial data extraction. Stage 8 (pink) applies validation rules. Stages 9-10 (green) persist results and finalize the pipeline.


Extraction Data

Once the pipeline completes, Nexus delivers structured data across four categories.

Securities Holdings

Each security extracted from the document includes:

Field Description Example
scheme_name Full name of the mutual fund or security SBI Bluechip Fund - Direct Plan - Growth
isin International Securities Identification Number INF200K01RJ1
units Number of units held 245.678
nav Net Asset Value per unit 78.45
market_value Current market value (units x NAV) 19,268.41
invested_value Total amount originally invested 15,000.00
gain_loss Absolute gain or loss 4,268.41
gain_loss_pct Percentage return 28.46%
folio_number Folio number for the holding 1234567890
amc Asset Management Company SBI Funds Management

Asset Allocation

Sector-level breakdown of the portfolio:

  • Equity percentage
  • Debt percentage
  • Hybrid/balanced percentage
  • Sector-wise allocation within equity (Large Cap, Mid Cap, Small Cap, Multi Cap)

Account Metadata

Document-level information:

  • account_holder_name — Name of the investor
  • pan — PAN card number (masked in UI)
  • amc — Asset Management Company name
  • folio_numbers — All folio numbers found in the document
  • statement_period — Date range the statement covers
  • statement_date — Date the statement was generated

Document Classification

  • product_type — Broad category (Mutual Fund, Insurance, PMS, AIF)
  • document_subtype — Specific type (CAS, Account Statement, Capital Gains Statement, Transaction Statement)
  • confidence_score — Classification confidence (0.0 to 1.0)

AIF Extraction Support

Nexus provides comprehensive extraction support for Alternative Investment Fund (AIF) documents with flexible schema handling to accommodate multiple fund house formats.

Supported AIF Formats

AIF statements vary widely in structure across different Asset Management Companies (AMCs). Nexus handles 5+ format variations through a format-agnostic architecture:

Format Description Example AMCs
AIFExtraction Portfolio-level extraction with portfolio companies, NAV, and performance metrics (TVPI, DPI, RVPI, MOIC, XIRR) Generic AIF funds
AIFStatementExtraction New NEXUS format with normalized account info, holdings summary, and transaction history. Distinguished by transaction_count field Motilal Oswal, Axis (new format)
AIFAccountStatementExtraction Flexible wrapper format with AccountStatement key containing AMC-specific fields Axis RERA, Motilal Oswal (legacy)
AIFFlatStatementExtraction Flat snake_case format without wrappers. Covers Format C (account_holder), Format D (personal_information), and future variants Various AMCs
NexusAifSoaExtraction Generic AIF Statement of Account (SOA) format with standardized investor details and holdings summary Generic SOA documents

Portfolio Companies Data

For AIF documents with underlying portfolio companies, Nexus extracts:

Field Description Example
name Portfolio company name TechCorp Private Ltd
sector Business sector Technology
investment_date Initial investment date 2023-06-15
invested_amount Total capital invested 50,00,000
current_value Current NAV of investment 72,00,000
moic Multiple on Invested Capital 1.44x
irr Internal Rate of Return 18.5%
status Investment status Active / Exited / Partial Exit

The Portfolio Companies Table in the UI provides:

  • Sortable columns (by name, sector, invested amount, current value, MOIC, IRR)
  • Search functionality across company names and sectors
  • Aggregated portfolio statistics (total companies, total invested, total NAV)

AIF Transaction History

Transaction-level data extracted from AIF statements:

Field Description
date Transaction date
description Transaction description
transaction_type Capital call / Distribution / NAV update
units Units affected
nav NAV per unit at transaction
amount Transaction amount
charges Associated fees/charges

AIF Account Information

Investor details extracted from AIF statements:

  • Account Statement: Folio number, scheme name, statement period, PAN, advisor details
  • Investor Details: Primary holder, joint holders (1-3), guardian, nominees (with allocation percentages)
  • Holdings Summary: Total commitment, capital contribution, uncalled commitment, units allotted, current NAV, valuation
  • Capital Summary: Total commitment, called capital, uncalled capital, distributed amount
  • Performance Metrics: TVPI (Total Value to Paid-In), DPI (Distributions to Paid-In), RVPI (Residual Value to Paid-In), MOIC, XIRR

Format Normalization

Nexus automatically normalizes different AIF formats to a unified structure for consistent rendering:

Field Name Mapping:

  • account_infoaccount_statement
  • holding_summary (singular) → holdings_summary (plural)
  • capital_commitmentcommitment_amount
  • capital_contributioncontribution_received_amount
  • units_allocatedunits_allotted
  • unit_classclass_name
  • heldis_held

Structural Transformations:

  • Individual holder objects (holder_1, holder_2, holder_3) → investor_details with typed fields
  • Individual nominee objects (nominee_1, nominee_2, nominee_3) → nominees array
  • Nested transaction structures → flattened transaction arrays

This normalization layer allows new AIF formats to be added without breaking existing UI components.


Export System

Nexus supports two export modes, each producing a downloadable dataset with selectable fields.

Export Flow

The diagram below shows the two export paths. Both start from a completed extraction, but they produce different output granularities — portfolio-level holdings vs. the underlying constituent securities within each holding.

flowchart LR
    subgraph Extraction["Completed Extraction"]
        EX["Structured Data<br/>from Pipeline"]
    end

    EX --> FieldsAPI["GET /export/{job_id}/fields<br/>Fetch available fields + coverage"]

    FieldsAPI --> HoldingsPath["Holdings Export"]
    FieldsAPI --> UnderlyingPath["Underlying Export"]

    subgraph HoldingsPath["Holdings Export"]
        H1["Select fields:<br/>scheme, ISIN, units,<br/>NAV, market_value"]
        H1 --> H2["Download<br/>1 row per holding"]
    end

    subgraph UnderlyingPath["Underlying Export"]
        U1["Select fields:<br/>parent_scheme, sector,<br/>allocation_pct"]
        U1 --> U2["Download<br/>1 row per constituent"]
    end

    classDef primary fill:#dbeafe,stroke:#3b82f6,color:#1e293b
    classDef secondary fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    classDef success fill:#d1fae5,stroke:#10b981,color:#1e293b
    classDef highlight fill:#fae8ff,stroke:#a855f7,color:#1e293b

    class EX primary
    class FieldsAPI secondary
    class H1,H2 success
    class U1,U2 highlight

Holdings Export

Exports portfolio-level holdings data. Each row is one security holding with its valuation.

Typical fields: scheme_name, isin, folio_number, units, nav, market_value, invested_value, gain_loss, gain_loss_pct, amc

Underlying Export

Exports the underlying securities within each holding. For fund-of-funds or multi-asset products, this breaks down the constituent securities.

Typical fields: parent_scheme, underlying_security, sector, allocation_pct, market_value

Field Selection

The export UI presents all available fields with coverage percentages. Coverage indicates what percentage of extracted records have a non-null value for that field. For example:

  • scheme_name — 100% coverage
  • isin — 98% coverage
  • nav — 95% coverage
  • gain_loss_pct — 82% coverage

Users select which fields to include before downloading.

Endpoint: GET /v1/nexus/export/{job_id}/fields returns available fields and coverage. The actual export download uses the selected field list as query parameters.


Token Usage and Costs

Nexus tracks LLM token usage at every pipeline stage. This gives full transparency into AI processing costs per document.

Each stage records:

Metric Description
input_tokens Number of tokens sent to the model
output_tokens Number of tokens generated by the model
model Model identifier used for that stage (e.g., GPT-4o, Claude)
cost_usd Cost in US dollars for that stage
cost_inr Cost in Indian Rupees for that stage

The Costs tab on the document detail page displays:

  • Per-stage token breakdown in a table
  • Total input and output tokens across all stages
  • Aggregate cost in both USD and INR
  • Model used at each stage (different stages may use different models for cost optimization)

Document Detail Page

After processing, each document has a dedicated detail page with five tabs providing complete visibility into the extraction.

Tab 1: Pipeline

An animated vertical timeline showing all 10 stages. Each stage displays:

  • Stage name and description
  • Status indicator (completed, in-progress, pending, failed)
  • Duration taken
  • Real-time animation during active processing

The timeline updates live while the pipeline is running, giving users immediate feedback on progress.

Tab 2: Data

The core extraction view. Presents structured data in interactive tables:

  • Securities table — Sortable, searchable list of all extracted holdings. Click any security row to expand and see full details including ISIN, folio, invested value, and gain/loss.
  • Summary cards — Total portfolio value, total invested, overall gain/loss, number of securities found.
  • Asset allocation chart — Visual breakdown by asset class.

Tab 3: Insights

AI-generated analysis of the extracted document:

  • Document summary — Natural language overview of what was found
  • Entity details — All detected entities (AMCs, account holders, folios) with their locations in the document
  • Data quality notes — Any validation warnings or low-confidence extractions flagged during validate_insights

Tab 4: Costs

Token usage and cost breakdown (see Token Usage section above). Displays a table with one row per pipeline stage showing input tokens, output tokens, model, and cost.

Tab 5: Export

Field selection interface for downloading extracted data:

  • Toggle between Holdings and Underlying export types
  • Checkbox list of all available fields with coverage percentages
  • Select All / Deselect All controls
  • Download button generates the export file

Frontend Routes

Route Description
/nexus Overview page. Shows upload action and recent document processing activity.
/nexus/upload Upload interface. Drag-and-drop or file picker for PDF documents. Triggers pipeline on upload.
/nexus/documents Document history list. All previously processed documents with status, date, and document type. Searchable and sortable. The Type column displays the AI-classified document type (e.g. Factsheet, Holdings, Snapshot) sourced from document_type. The list auto-refreshes every 30 seconds via X-Poll-Interval header.
/nexus/documents/[docId] Document detail page with the 5-tab layout (Pipeline, Data, Insights, Costs, Export).

API Endpoints Reference

Method Endpoint Description
POST /api/v1/nexus/upload/ Upload a PDF document (multipart form data)
POST /api/v2/nexus/pipeline/process Start the 10-stage pipeline. Returns job_id and process_id
GET /api/v2/nexus/status/process/{process_id}/progress Poll pipeline progress. Returns stage name, percentage, per-stage status
GET /api/v1/documents/me List documents belonging to the authenticated user. Polled every 30 s via X-Poll-Interval: 30 response header.
GET /api/v1/nexus/documents/{doc_id}/extraction Fetch full extraction results for a completed document
GET /api/v1/nexus/export/{job_id}/fields Get available export fields with coverage percentages
GET /api/v1/nexus/doc-fetch/{job_id}?type=ORIGINAL Download the original uploaded PDF
GET /api/v1/nexus/stats Admin-only aggregate statistics across all processed documents

All endpoints route through the Studio BFF gateway (studio-backend-dev.centricitywealth.tech). The frontend never calls Nexus backend directly.

User-scoped document listing: As of v6.2, getDocuments() calls GET /documents/me instead of GET /documents/. This returns only the authenticated user’s documents rather than the full collection. Query parameters (limit, skip, product_type) are forwarded as a URL query string.

Document type column: As of v6.3, the Type column in the documents list displays document_type (the AI-classified document category such as factsheet, holdingstatement, portfoliosnapshot) instead of product_type. Each type renders with a distinct colour badge. If document_type is absent in the API response, the column shows -.