RAG pipeline for synthesizing peer-reviewed scientific literature using open-source LLMs with directional ablation, deployed on private infrastructure.

Architecture

01

Corpus

11 async data sources, deduplication by DOI, paragraph chunking, MiniLM embeddings into ChromaDB.

02

Retrieval

Dense vector search + BM25 sparse retrieval, reciprocal rank fusion, cross-encoder reranking.

03

Synthesis

RunPod serverless inference with ablated open-weights LLMs. RAG-grounded output with inline citations.

04

Validation

Citation verification, hallucination detection, uncertainty quantification, human review gate.

Data Sources

Literature

Semantic Scholar
Papers, abstracts, citation graphs
PubMed
Biomedical papers, MeSH terms
arXiv
Preprints, PDF links
CrossRef
DOI resolution, bibliographic metadata
CORE
Full-text open access papers
OpenAlex
Papers, abstracts, citation data
ClinicalTrials.gov
Trial metadata, status, outcomes

Statistics

U.S. Census Bureau
ACS, decennial census, economic data
FRED
GDP, unemployment, inflation series
OECD
International health, education, economics
data.gov
U.S. federal open data catalog

Quickstart

# Clone and install
$ git clone https://github.com/opensynthesislabs/open-synthesis.git
$ cd open-synthesis
$ uv sync

# List available data sources
$ open-synthesis sources

# Ingest papers on a topic
$ open-synthesis ingest "psilocybin depression" --sources semantic_scholar,pubmed

# Run a synthesis (requires RunPod endpoint)
$ open-synthesis synthesize "What is the evidence for psilocybin as a treatment for MDD?"