Scientific Intelligence//v1/preprints

PreprintNode
arXiv / bioRxiv / medRxiv Preprint Feed API

Real-time preprint feed with NLP-ready abstracts, author affiliations, and DOIs.

Built forScientific research agents, RAG pipelines, scouting AIs.

GET /v1/preprints

[
  {
    "id": "preprint-0000",
    "source": "example",
    "source_id": "preprint-0000",
    "type": "preprint",
    "discovered_at": "1970-01-01T00:00:00.000Z",
    "payload": {
      "archive": "bioRxiv",
      "affiliation": "Meta FAIR"
    }
  },
  {
    "id": "preprint-0001",
    "source": "example",
    "source_id": "preprint-0001",
    "type": "preprint",
    "discovered_at": "1969-12-31T22:00:00.000Z",
    "payload": {
      "archive": "bioRxiv",
      "affiliation": "Meta FAIR"
    }
  },
  {
    "id": "preprint-0002",
    "source": "example",
    "source_id": "preprint-0002",
    "type": "preprint",
    "discovered_at": "1969-12-31T18:00:00.000Z",
    "payload": {
      "archive": "arXiv cond-mat",
      "affiliation": "Meta FAIR"
    }
  }
]

Schema fields

titlePaper Title
payload.archiveArchive
payload.affiliationLead Affiliation

Realtime

Webhooks

HMAC signed

Starts at

$49/mo

Curator tier · cancel anytime

Drop-in for any stack

Wire PreprintNode into your agent — one snippet, seven frameworks.

One-line install

curl https://www.aiagentnode.io/api/v1/nodes/preprint/records?limit=10 \
  -H "Authorization: Bearer $AIAGENTNODE_KEY"

What teams build

PreprintNode powers…

Autonomous agents

Pipe PreprintNode into your LangChain / Vercel AI / Lovable agent for real-time decisions.

RAG enrichment

Schema-stable JSON drops straight into your vector store with predictable embeddings and no parsing.

Internal dashboards

Webhook into Slack, Notion, Linear, or your own ops console. Same auth across every node.

How PreprintNode works

Sourced from primary registries. Normalized for agents.

Primary data sources

arXiv OAI-PMH feed
bioRxiv API
medRxiv API
ChemRxiv
SSRN cross-checks

We ingest from the upstream of record — never from secondary scrapers — so every preprintnode record in your agent traces back to an authoritative publisher.

Methodology

Continuous polling of upstream sources with adaptive backoff
Deduplication via stable source_id + content fingerprint
LLM-friendly normalization into a single JSON envelope
Schema versioning so existing agents never break
HMAC-signed webhooks for guaranteed-delivery push

Update cadence

Every 15 minutes

End-to-end latency

Under 60 seconds from upstream publication to API

Coverage

All major STEM preprint archives

History

Rolling 36-month archive on Pro+

PreprintNode vs DIY scraping

Stop maintaining brittle scrapers. Ship the agent.

Most teams spend a quarter rebuilding what PreprintNode ships in a single API key. Here's the honest tradeoff.

Dimension	PreprintNode	DIY scraper
Setup time	One API key, one endpoint	Weeks of scraper engineering
Schema stability	Versioned JSON contract	Breaks every upstream redesign
Freshness	<60s from publication	Cron job, often hours stale
Delivery	Polling + HMAC webhooks	Build your own queue
LLM readiness	Token-optimized payload	Manual cleaning per record
Compliance	Source TOS handled upstream	Your legal exposure

Integrate in 3 steps

From signup to first PreprintNode record in under 5 minutes.

1. Generate an API key
Pick a tier, complete checkout, and a Bearer token is minted instantly — no email handoff.
2. Call /v1/preprints
Send an authenticated GET to receive a paginated JSON envelope. Cursor in, more records out.
3. Subscribe to webhooks
Register an HTTPS endpoint to receive HMAC-signed pushes within seconds of upstream publication.

Frequently asked

PreprintNode questions, answered.

How fresh is the PreprintNode API data?+

New records appear in /v1/preprints within ~60 seconds of being published upstream. The full dataset is polled every 5 minutes and pushed to webhook subscribers in the same window.

What format does PreprintNode return?+

Every endpoint returns a stable JSON envelope: { id, source, source_id, type, discovered_at, payload }. The payload mirrors the source's natural shape, normalized and token-optimized for direct ingestion into LLMs, vector stores, and RAG pipelines.

Is PreprintNode suitable for AI agents and RAG pipelines?+

Yes — that is the primary design goal. Field keys are stable, units are normalized, free text is cleaned, and total token weight per record is minimized so you can drop responses directly into LangChain, Vercel AI SDK, OpenAI tools, MCP, n8n, or Lovable agents without preprocessing.

How is authentication handled?+

A single API key (Bearer token) works across every node. Webhook payloads are HMAC-SHA256 signed with your tenant secret so you can verify provenance before acting on a record.

Can I get historical PreprintNode data for backtesting?+

Pro tiers include a rolling 24-month backfill via the same endpoint with ?since=<ISO date>. Scale and Agency tiers can request full historical exports as compressed JSONL.

What is the rate limit?+

Curator: 60 req/min. Pro: 600 req/min. Scale: 6,000 req/min. Agency: 60,000 req/min. Enterprise: negotiated. All tiers support paginated cursors so you never need to spike for a backfill.

Why not just scrape preprintnode sources directly?+

Upstream sources change formats, throttle aggressively, and break silently. We absorb that fragility, version the schema, sign deliveries, and republish a single contract so your agent stays online when the source moves.

HMAC-signed webhooks

SHA-256 signatures on every push.

99.95% uptime SLA

Status + history at /status.

Versioned schema

No silent breaking changes, ever.

One auth across nodes

One key, every endpoint.

Related nodes