PaperManager

AI Pipelines

PaperManager uses multiple AI models for different tasks. This page documents each pipeline, the models used, and how they fit together.


Models Used

Model Provider Used for
claude-opus-4-6 Anthropic Paper summarisation, single-paper chat, knowledge chat
claude-haiku-4-5-20251001 Anthropic Abstract extraction, reference extraction, topic suggestion, conversation compaction
llama3.2:3b Ollama (local) Metadata extraction (layer 2), tag suggestion, arXiv query generation, figure captions, affiliation extraction, Cypher assist
Claude Vision Anthropic Figure chat, figure captioning (claude-vision mode)

All Anthropic calls can be routed through an enterprise Foundry gateway by setting ANTHROPIC_WORK_API_KEY and ANTHROPIC_WORK_BASE_URL.


Metadata Extraction Pipeline (PDF Upload)

Runs when a PDF is uploaded. Tries four strategies in order, stopping at the first success:

flowchart TD
    Start["PDF bytes"] --> Extract["Docling extracts raw_text"]
    Extract --> FindDOI{DOI / arXiv ID\nfound in text?}

    FindDOI -->|"Yes"| S2["Layer 1a: Semantic Scholar API"]
    S2 -->|fail| CR["Layer 1a fallback: CrossRef API"]

    FindDOI -->|"No DOI but\ntitle found"| S2Title["Layer 1b: S2 title search"]

    FindDOI -->|"Nothing found"| Ollama["Layer 2: Ollama llama3.2:3b\non first 3 000 chars"]
    Ollama -->|"Ollama unavailable"| Heuristic["Layer 3: Regex heuristics\n(first line = title, year regex)"]

    S2 --> AbstractCheck{Abstract\nextracted?}
    CR --> AbstractCheck
    S2Title --> AbstractCheck
    Ollama --> AbstractCheck
    Heuristic --> AbstractCheck

    AbstractCheck -->|"No"| AbstractFallback["ABSTRACT_RE regex →\nClaude Haiku if regex fails"]
    AbstractCheck -->|"Yes"| Done["✅ Metadata ready"]
    AbstractFallback --> Done

Layer Details

Layer Trigger Service Output
1a (primary) DOI or arXiv ID in text services/metadata_lookup.py → Semantic Scholar title, year, authors, abstract, topics, citation count, venue
1a (fallback) S2 fails CrossRef API title, year, authors, doi, venue
1b Title found, no DOI S2 title search same as 1a
2 No DOI, no useful title Ollama llama3.2:3b on raw_text[:3000] title, year, authors (structured JSON)
3 Ollama unavailable Regex on raw_text title (first non-empty line), year (4-digit year regex)
Abstract fallback Abstract still missing ABSTRACT_RE regex → Claude Haiku abstract text

The metadata_source property on the Paper node records which layer was used.


Paper Summarisation

Triggered after PDF upload or via POST /backfill/summary.

flowchart LR
    A["abstract + raw_text"] --> P["Load prompts/summary.txt"]
    P --> C["Claude Opus 4.6\n(claude-opus-4-6)"]
    C --> S["summary string\nsaved to Paper.summary"]

The prompt template at prompts/summary.txt structures the output as:


Topic Suggestion

Triggered during upload or via POST /papers/{id}/topics/suggest or bulk backfill.

flowchart LR
    A["title + abstract"] --> P["Load prompts/topics.txt"]
    P --> C["Claude Haiku\n(claude-haiku-4-5-20251001)"]
    C --> T["3–6 title-case topic strings\ne.g. 'Protein Structure Prediction'"]
    T --> DB["MERGE Topic nodes\n+ ABOUT relationships"]

Tag Suggestion

Triggered in the upload modal (optional step) or via POST /tags/suggest.

flowchart LR
    A["title + abstract"] --> O["Ollama llama3.2:3b"]
    O --> Tags["List of tag names\nfrom seeded tag vocabulary"]

Ollama is constrained to suggest only tags from the existing tag vocabulary.


Reference Extraction Pipeline

Triggered when the user clicks “Extract References” on the Paper Detail page, or via GET /papers/{id}/extract-references.

flowchart TD
    Start["Paper with raw_text"] --> A{DOI available?}
    A -->|"Yes"| S2["Strategy A:\nSemantic Scholar /references API"]
    A -->|"No"| B["Strategy B:\nRegex on REFERENCES section\nof raw_text"]
    S2 -->|"< 3 results"| B
    B -->|"< 3 results"| Claude["Strategy C:\nClaude Haiku on last 30%\nof raw_text"]
    S2 -->|"≥ 3 results"| Done["✅ Reference list"]
    B -->|"≥ 3 results"| Done
    Claude --> Done

Each extracted reference creates a Paper stub node (title + DOI) tagged from-references and linked via CITES. Stubs are enriched if the full paper is later imported.


Single-Paper Chat

Triggered via POST /papers/{id}/chat.

flowchart LR
    Q["User question"] --> P["Load prompts/chat_system.txt"]
    P --> C["Selected model:\nClaude Opus / Claude Work / Ollama"]
    C --> R["Context: raw_text\n(truncated to model limit)"]
    R --> Response["Streaming response\nreturned to browser"]

Figure Extraction & Captioning

Triggered via POST /papers/{id}/figures/extract.

flowchart TD
    PDF["PDF from Google Drive"] --> Docling["Docling: extract\npage images + figure regions"]
    Docling --> Method{Caption method\nfrom settings}
    Method -->|"docling"| DocCaption["Docling structural caption"]
    Method -->|"ollama"| OllamaCaption["Ollama llama3.2:3b\nfrom prompts/figure_captions.txt"]
    Method -->|"claude-vision"| ClaudeCaption["Claude Vision\n(claude-haiku)"]
    DocCaption --> Upload["Upload PNG to Google Drive"]
    OllamaCaption --> Upload
    ClaudeCaption --> Upload
    Upload --> DB["Save Figure node in Neo4j"]

Figure Vision Chat

Triggered via POST /papers/{id}/figures/{fig_id}/chat.

The figure image is retrieved from Google Drive and sent to Claude with the question:

System: You are analysing a scientific figure.
User: [image bytes] + question text

Knowledge Chat Context Assembly

Triggered via POST /knowledge-chat/stream.

flowchart TD
    Q["User question"] --> Parse["Parse @mentions\n@tag:, @topic:, @project:, @paper:"]
    Parse -->|"Mentions found"| Cypher["Run Cypher queries\nto fetch matching papers"]
    Parse -->|"No mentions"| Recent["Fetch 10 most recently\nadded papers"]
    Cypher --> Budget["Apply token budget\nper paper (truncate raw_text)"]
    Recent --> Budget
    Budget --> System["Load prompts/knowledge_chat_system.txt"]
    System --> Claude["Claude Opus 4.6\n(streaming SSE)"]
    Claude --> Browser["Token-by-token response\nto browser"]

Affiliation Extraction

Triggered as part of the paper upload when author affiliations are missing.

flowchart LR
    T["raw_text first 2 000 chars"] --> P["Load prompts/author_affiliations.txt"]
    P --> O["Ollama llama3.2:3b"]
    O --> A["author → affiliation mapping"]
    A --> DB["Update Person.affiliation in Neo4j"]

Prompt Templates

All prompts live in prompts/ and are loaded fresh on each call — edit without restarting:

File Used in Purpose
summary.txt ai.py Paper summarisation
topics.txt ai.py Topic suggestion
chat_system.txt ai.py Single-paper Q&A system prompt
knowledge_chat_system.txt knowledge_chat.py Multi-paper synthesis system prompt
figure_captions.txt figure_extractor.py Figure caption generation
author_affiliations.txt pdf_parser.py Author affiliation extraction