The backend is a Python 3.11 FastAPI application. It also runs a separate MCP server process that shares the same business logic.
backend/
├── main.py # FastAPI app entry — mounts all routers
├── config.py # Reads env vars (pydantic Settings)
├── logger.py # Logging setup
├── mcp_server.py # MCP server entry point (separate process)
│
├── db/
│ ├── connection.py # Neo4j driver singleton
│ ├── schema.py # Schema setup (indexes, constraints)
│ └── queries/
│ ├── papers.py # All Cypher for Paper nodes
│ ├── people.py # All Cypher for Person nodes
│ ├── topics.py # All Cypher for Topic nodes
│ ├── tags.py # All Cypher for Tag nodes
│ ├── notes.py # All Cypher for Note nodes + MENTIONS
│ └── projects.py # All Cypher for Project nodes
│
├── routers/
│ ├── papers.py # POST /papers, GET /papers, etc.
│ ├── people.py # CRUD for Person nodes
│ ├── topics.py # CRUD for Topic nodes
│ ├── tags.py # CRUD for Tag nodes + tag seeding
│ ├── projects.py # CRUD for Project nodes
│ ├── search.py # GET /search
│ ├── graph.py # GET /graph (graph visualisation data)
│ ├── stats.py # GET /stats
│ ├── cypher.py # Cypher editor endpoints
│ ├── export.py # BibTeX export
│ ├── backfill.py # Bulk enrichment
│ ├── knowledge_chat.py # Multi-paper chat (SSE)
│ ├── figures.py # Figure extraction + image serving
│ └── bulk_import.py # Bulk import (SSE stream)
│
├── services/
│ ├── ai.py # Claude: summarise, chat, topics, figures
│ ├── drive.py # Upload PDF/images to Drive, get download URL
│ ├── pdf_parser.py # Extract raw text; orchestrate metadata extraction
│ ├── metadata_lookup.py # Semantic Scholar + CrossRef API clients
│ ├── metadata_from_url.py # URL/DOI/arXiv/PubMed resolver
│ ├── figure_extractor.py # Docling / Ollama / Claude Vision figure extraction
│ ├── note_parser.py # Parse @Name and #Topic from markdown text
│ ├── references.py # Reference extraction pipeline
│ └── bulk_resolver.py # Per-entry resolver for bulk import
│
├── models/
│ └── schemas.py # Pydantic request/response models
│
├── tools/ # MCP tool definitions
│ ├── paper_tools.py
│ ├── note_tools.py
│ ├── tag_tools.py
│ ├── person_tools.py
│ ├── project_tools.py
│ └── ai_tools.py
│
├── tests/
│ ├── test_papers.py
│ ├── test_notes.py
│ ├── test_note_parser.py
│ ├── test_drive.py
│ ├── test_ai.py
│ └── test_mcp_tools.py
│
├── prompts/ # Prompt templates (loaded fresh each call)
│ ├── summary.txt
│ ├── topics.txt
│ ├── chat_system.txt
│ ├── knowledge_chat_system.txt
│ ├── figure_captions.txt
│ └── author_affiliations.txt
│
└── requirements.txt
main.py creates the FastAPI application, sets up CORS, registers all routers, and defines a startup lifespan:
@asynccontextmanager
async def lifespan(app: FastAPI):
get_driver().verify_connectivity() # verify Neo4j
run_schema_setup(get_driver()) # create indexes + constraints
seed_default_tags(get_driver()) # seed 157 default tags
yield
The app is started by start.sh via uvicorn backend.main:app.
Uses Pydantic Settings to read environment variables with type validation:
class Settings(BaseSettings):
neo4j_uri: str
neo4j_user: str
neo4j_password: str
google_client_id: str
google_client_secret: str
google_drive_folder_id: str
anthropic_api_key: str
# ... etc.
model_config = SettingsConfigDict(env_file=".env")
settings is a module-level singleton imported throughout the app.
Manages a Neo4j driver singleton:
def get_driver() -> Driver:
# returns module-level cached driver instance
Runs on startup to create Neo4j indexes and uniqueness constraints. Idempotent — safe to run multiple times.
Each file contains functions that:
driver.session().run()No FastAPI or MCP types leak into this layer.
Example pattern:
def create_paper(driver: Driver, paper_data: dict) -> dict:
with driver.session() as session:
result = session.run(
"""
MERGE (p:Paper {doi: $doi})
SET p += $props
RETURN p
""",
doi=paper_data["doi"],
props=paper_data,
)
return result.single()["p"]
Each router file:
APIRouter with a prefix and tagsdb/queries/ and services/models/schemas.pyExample:
router = APIRouter(prefix="/papers", tags=["papers"])
@router.post("/upload", response_model=PaperOut)
async def upload_paper(file: UploadFile, ...):
pdf_bytes = await file.read()
raw_text = pdf_parser.extract_text(pdf_bytes)
metadata = await pdf_parser.extract_metadata(raw_text)
drive_id = drive.upload_pdf(pdf_bytes)
summary = await ai.summarize_paper(metadata["abstract"])
paper = db_papers.create_paper(driver, {...})
return paper
| File | Responsibility |
|---|---|
ai.py |
All Claude API calls — summarise, chat, topic suggestion, figure captions, reference extraction |
drive.py |
Upload files to Google Drive; generate download URLs; handle OAuth flow |
pdf_parser.py |
Extract raw text with Docling; orchestrate the 4-layer metadata extraction pipeline |
metadata_lookup.py |
HTTP clients for Semantic Scholar and CrossRef |
metadata_from_url.py |
Parse and resolve URLs (arXiv, DOI, PubMed, bioRxiv, medRxiv) |
figure_extractor.py |
Extract figures from PDF pages; generate captions via Docling/Ollama/Claude |
note_parser.py |
Regex-based @Name and #Topic extraction from Markdown text |
references.py |
Three-strategy reference extraction (S2 API → regex → Claude Haiku) |
bulk_resolver.py |
Per-entry resolution logic for the bulk import endpoint |
Defines all request and response models. These are used:
Key models include PaperOut, PersonOut, NoteOut, ProjectOut, TagOut, TopicOut, HealthResponse, and various *Create / *Update input models.
Each file in tools/ registers MCP tools using FastMCP:
from fastmcp import FastMCP
mcp = FastMCP("PaperManager")
@mcp.tool()
async def search_papers(query: str, tag: str = None, ...) -> list[dict]:
"""Search papers by keyword, tag, topic, project, or person."""
return db_papers.search(get_driver(), query, tag=tag, ...)
Tools are thin wrappers — validation happens in FastMCP, logic lives in db/queries/.
All AI prompt templates are plain text files loaded fresh on each API call:
def load_prompt(name: str) -> str:
path = Path(__file__).parent.parent / "prompts" / name
return path.read_text()
Edit a prompt file without restarting the backend.
Tests live in backend/tests/ and use pytest. Run with:
cd backend
pytest
Tests mock external services (Neo4j, Drive, Claude) to run without real credentials.