Paperdl Clients

This document summarizes the built-in Paperdl clients, their search/download arguments, and practical examples. For a short usage-oriented introduction, read QuickStart.md first.

1. Common Model

All built-in clients inherit from BasePaperClient and follow the same basic pattern:

import asyncio
from paperdl.modules import ArxivPaperClient

async def main():
    async with ArxivPaperClient(verbose=False, show_progress=False) as client:
        papers = await client.search("diffusion model", total_results=5)
        paths = await client.download(papers, output_dir="papers")
        print(paths)

asyncio.run(main())

Common download API:

paths = await client.download(
    papers,
    output_dir="paperdl_outputs",
    overwrite=False,
    return_exceptions=False,
)

Common constructor arguments:

Parameter Description
timeout Total HTTP request timeout.
concurrency Internal per-client concurrency for requests and downloads.
max_retries Number of retries after request or download failures.
retry_backoff Base wait time for exponential backoff.
headers Custom HTTP headers.
cookies / cookie_file Cookies passed directly, or a cookie file to load and save.
proxy Proxy URL, for example http://127.0.0.1:7890.
thread_workers Thread pool size for blocking tasks.
show_progress Whether to show rich progress bars.
progress_mode Progress mode: auto, summary, detailed, or none.
max_detail_tasks Maximum number of detailed tasks shown in auto mode.
verbose Whether to print logs.

download(...) calls the client-specific downloaditem(...) internally, so direct calls to downloaditem are rarely needed.

2. ArxivPaperClient

Registered name: arxiv

Import:

from paperdl.modules import ArxivPaperClient

Search arguments

await client.search(
    query,
    total_results=100,
    page_size=50,
    categories=None,
    search_field="all",
    sort_by="submittedDate",
    sort_order="descending",
    raw_query=False,
    deduplicate=True,
)
Parameter Description
query Search keywords.
total_results Maximum number of returned papers.
page_size Number of results requested per API page.
categories arXiv categories, for example ['cs.CL', 'cs.CV'].
search_field arXiv search field. Default: all.
sort_by relevance, lastUpdatedDate, or submittedDate.
sort_order ascending or descending.
raw_query If True, use the raw arXiv query string without automatic field/category composition.
deduplicate Deduplicate by PaperInfo.identity_key.

Examples

Search recent arXiv papers in selected categories:

async with ArxivPaperClient(verbose=False) as client:
    papers = await client.search(
        "large language model",
        total_results=20,
        categories=["cs.CL", "cs.AI"],
        sort_by="submittedDate",
    )

Use a raw arXiv query:

papers = await client.search(
    'cat:cs.CL AND all:"retrieval augmented generation"',
    total_results=10,
    raw_query=True,
    sort_by="relevance",
)

Query by arXiv ID:

paper = await client.getbyid("1706.03762")
papers = await client.searchbyids(["1706.03762", "2307.09288"])

Download results:

paths = await client.download(papers[:3], output_dir="papers/arxiv")

Pass arXiv parameters through PaperClient:

from paperdl import PaperClient

async with PaperClient(
    ["arxiv"],
    client_search_kwargs={"arxiv": {"categories": ["cs.CL"], "sort_by": "relevance"}},
) as client:
    papers = await client.search("transformer", total_results=10)

3. OpenReviewPaperClient

Registered name: openreview

Import:

from paperdl.modules import OpenReviewPaperClient

Additional constructor parameters

Parameter Description
baseurl OpenReview API URL. Default: https://api2.openreview.net.
username / password Credentials used when login is required.
api_version Default: 2, using openreview.api.OpenReviewClient.

Search parameters

await client.search(
    query=None,
    total_results=100,
    venue_id=None,
    invitation=None,
    content=None,
    details=None,
    accepted_only=False,
    client_side_filter=True,
)
Parameter Description
query Optional keyword query. By default, filtering is performed locally on title, abstract, authors, keywords, and related fields.
venue_id Venue ID, for example ICLR.cc/2024/Conference.
invitation OpenReview invitation, for example ICLR.cc/2024/Conference/-/Submission.
content OpenReview content filter, for example {'venueid': 'ICLR.cc/2024/Conference'}.
details details argument passed to OpenReview get_all_notes.
accepted_only When used with venue_id, return only accepted papers.
client_side_filter Whether to filter locally by query.

At least one of venue_id, invitation, or content must be provided.

Examples

Search a conference:

async with OpenReviewPaperClient(verbose=False) as client:
    papers = await client.search(
        "diffusion",
        venue_id="ICLR.cc/2024/Conference",
        total_results=20,
    )

Return accepted papers only:

papers = await client.search(
    venue_id="ICLR.cc/2024/Conference",
    accepted_only=True,
    total_results=100,
)

Use an invitation:

papers = await client.search(
    "reasoning",
    invitation="ICLR.cc/2024/Conference/-/Submission",
    total_results=20,
)

Download results:

paths = await client.download(papers[:5], output_dir="papers/openreview")

If the normal PDF URL fails, the client falls back to the OpenReview attachment API.

4. ACLAnthologyPaperClient

Registered name: acl_anthology

Common aliases: acl

Import:

from paperdl.modules import ACLAnthologyPaperClient

Search parameters

await client.search(
    query=None,
    total_results=100,
    collection_ids=None,
    max_collections=40,
    deduplicate=True,
)
Parameter Description
query Optional keyword query. If empty, papers in the scanned collections are returned.
collection_ids ACL Anthology XML collection IDs, for example ['2024.acl-long'].
max_collections When collection_ids is not provided, scan this many recent collections.
deduplicate Deduplicate results.

Examples

Search recent collections:

async with ACLAnthologyPaperClient(verbose=False) as client:
    papers = await client.search(
        "machine translation",
        total_results=20,
        max_collections=30,
    )

Search selected collections:

papers = await client.search(
    "large language model",
    collection_ids=["2024.acl-long", "2024.findings-acl"],
    total_results=20,
)

Download results:

paths = await client.download(papers[:5], output_dir="papers/acl")

5. BioRxivPaperClient and MedRxivPaperClient

Registered names: biorxiv, medrxiv

Import:

from paperdl.modules import BioRxivPaperClient, MedRxivPaperClient

MedRxivPaperClient inherits from BioRxivPaperClient. The main difference is that the source server is medrxiv instead of biorxiv.

Additional constructor parameters

Parameter Description
browser_fallback Whether to use a Playwright browser fallback when normal download fails.
browser_headless Whether the fallback browser runs in headless mode.
browser_channel Chromium channel, for example chrome.
browser_user_data_dir Browser user data directory.
browser_wait_seconds Wait time before retrying when a challenge page is detected.

Search parameters

await client.search(
    query=None,
    total_results=100,
    from_date="2024-01-01",
    to_date=None,
    max_scan_results=5000,
    page_size=100,
    deduplicate=True,
)
Parameter Description
query Optional keyword query. Matching is performed locally on title, abstract, authors, categories, and related fields.
from_date / to_date API scan date range in YYYY-MM-DD format. to_date=None means today.
max_scan_results Maximum number of source records to scan, which prevents very large date ranges from becoming too slow.
page_size Number of records per API request.
deduplicate Deduplicate results.

Examples

Search and download from bioRxiv:

async with BioRxivPaperClient(verbose=False) as client:
    papers = await client.search(
        "single cell",
        from_date="2025-01-01",
        total_results=20,
    )
    paths = await client.download(papers[:3], output_dir="papers/biorxiv")

Search medRxiv:

async with MedRxivPaperClient(verbose=False) as client:
    papers = await client.search(
        "medical imaging",
        from_date="2025-01-01",
        total_results=20,
    )

Disable browser fallback:

async with BioRxivPaperClient(browser_fallback=False) as client:
    papers = await client.search("protein design", total_results=5)

6. PMLRPaperClient

Registered name: pmlr

Import:

from paperdl.modules import PMLRPaperClient

Search parameters

await client.search(
    query=None,
    total_results=100,
    volume_ids=None,
    max_volumes=None,
    enrich_abstracts=True,
    deduplicate=True,
)
Parameter Description
query Optional keyword query. Matching uses title, abstract, authors, venue, and related fields.
volume_ids Selected PMLR volumes, for example [235, 238].
max_volumes When volume_ids is not provided, scan this many recent volumes.
enrich_abstracts Visit paper detail pages to enrich abstracts, PDF URLs, authors, year, and other metadata.
deduplicate Deduplicate results.

Examples

Search recent volumes:

async with PMLRPaperClient(verbose=False) as client:
    papers = await client.search(
        "diffusion",
        max_volumes=30,
        total_results=20,
    )

Search selected volumes:

papers = await client.search(
    "reinforcement learning",
    volume_ids=[235, 238],
    total_results=20,
)

Run a lightweight search without visiting detail pages:

papers = await client.search(
    "optimization",
    max_volumes=20,
    enrich_abstracts=False,
)

Download results:

paths = await client.download(papers[:5], output_dir="papers/pmlr")

7. PMCOAPaperClient

Registered name: pmc_oa

Common aliases: pmc, pubmed, pubmed_central

Import:

from paperdl.modules import PMCOAPaperClient

Additional constructor parameters

Parameter Description
tool Tool name passed to NCBI E-utilities.
email Email passed to NCBI E-utilities.
api_key NCBI API key.
api_delay Wait time between paginated API requests.

Search parameters

await client.search(
    query=None,
    total_results=100,
    page_size=50,
    sort="relevance",
    require_pdf=True,
    deduplicate=True,
)
Parameter Description
query PMC search query. The client automatically applies an open-access filter.
total_results Maximum number of returned papers.
page_size Number of results per API page. Internally capped at 200.
sort NCBI search sort field. Default: relevance.
require_pdf Keep only records with an open-access PDF download link.
deduplicate Deduplicate results.

Examples

Search PMC Open Access:

async with PMCOAPaperClient(email="you@example.com", verbose=False) as client:
    papers = await client.search(
        "cancer immunotherapy",
        total_results=20,
        require_pdf=True,
    )

Use an API key:

async with PMCOAPaperClient(
    email="you@example.com",
    api_key="YOUR_NCBI_API_KEY",
    verbose=False,
) as client:
    papers = await client.search("single cell sequencing", total_results=50)

Download results:

paths = await client.download(papers[:5], output_dir="papers/pmc")

8. Combine Multiple Clients with PaperClient

Use a concrete client when you need fine-grained control over one source. Use PaperClient when you want unified search and download across multiple sources:

import asyncio
from paperdl import PaperClient

async def main():
    async with PaperClient(
        ["arxiv", "acl_anthology", "pmlr", "pmc_oa"],
        default_init_kwargs={"verbose": False},
        client_init_kwargs={
            "pmc_oa": {"email": "you@example.com"},
        },
        client_search_kwargs={
            "arxiv": {"categories": ["cs.CL"]},
            "acl_anthology": {"max_collections": 20},
            "pmlr": {"max_volumes": 30},
            "pmc_oa": {"require_pdf": True},
        },
        search_concurrency=4,
    ) as client:
        papers = await client.search("large language model", total_results=10)
        paths = await client.download(papers[:10], output_dir="papers")
        print(len(papers), len(paths))

asyncio.run(main())

Return results grouped by source:

results = await client.search(
    "diffusion",
    total_results=10,
    return_by_client=True,
)

for source, papers in results.items():
    print(source, len(papers))

9. Common PaperInfo Fields

Every search result is a PaperInfo object:

Field Description
source Source client, for example ArxivPaperClient.
title Paper title.
abstract Abstract text.
authors Author list.
article_url Paper detail page.
download_url PDF download URL.
doi / arxiv_id DOI or arXiv ID.
venue / publisher Conference, journal, or publisher.
published_at / updated_at Publication and update timestamps.
source_id Source-specific internal ID.
keywords / categories / tags Keywords, categories, and tags.
extra Source-specific metadata.

Useful methods and properties:

paper.year
paper.main_url
paper.short_authors
paper.identity_key
paper.filename()
paper.todict()
paper.tojson()

Restore a paper from a dictionary or JSON string:

from paperdl.modules import PaperInfo

paper = PaperInfo.fromdict(data)
paper = PaperInfo.fromjson(text)