Paperdl Clients

This document summarizes the built-in Paperdl clients, their search/download arguments, and practical examples. For a short usage-oriented introduction, read QuickStart.md first.

1. Common Model

All built-in clients inherit from BasePaperClient and follow the same basic pattern:

import asyncio
from paperdl.modules import ArxivPaperClient

async def main():
    async with ArxivPaperClient(verbose=False, show_progress=False) as client:
        papers = await client.search("diffusion model", total_results=5)
        paths = await client.download(papers, output_dir="papers")
        print(paths)

asyncio.run(main())

Common download API:

paths = await client.download(
    papers,
    output_dir="paperdl_outputs",
    overwrite=False,
    return_exceptions=False,
)

Common constructor arguments:

Parameter	Description
`timeout`	Total HTTP request timeout.
`concurrency`	Internal per-client concurrency for requests and downloads.
`max_retries`	Number of retries after request or download failures.
`retry_backoff`	Base wait time for exponential backoff.
`headers`	Custom HTTP headers.
`cookies` / `cookie_file`	Cookies passed directly, or a cookie file to load and save.
`proxy`	Proxy URL, for example `http://127.0.0.1:7890`.
`thread_workers`	Thread pool size for blocking tasks.
`show_progress`	Whether to show rich progress bars.
`progress_mode`	Progress mode: `auto`, `summary`, `detailed`, or `none`.
`max_detail_tasks`	Maximum number of detailed tasks shown in `auto` mode.
`verbose`	Whether to print logs.

download(...) calls the client-specific downloaditem(...) internally, so direct calls to downloaditem are rarely needed.

2. ArxivPaperClient

Registered name: arxiv

Import:

from paperdl.modules import ArxivPaperClient

Search arguments

await client.search(
    query,
    total_results=100,
    page_size=50,
    categories=None,
    search_field="all",
    sort_by="submittedDate",
    sort_order="descending",
    raw_query=False,
    deduplicate=True,
)

Parameter	Description
`query`	Search keywords.
`total_results`	Maximum number of returned papers.
`page_size`	Number of results requested per API page.
`categories`	arXiv categories, for example `['cs.CL', 'cs.CV']`.
`search_field`	arXiv search field. Default: `all`.
`sort_by`	`relevance`, `lastUpdatedDate`, or `submittedDate`.
`sort_order`	`ascending` or `descending`.
`raw_query`	If `True`, use the raw arXiv query string without automatic field/category composition.
`deduplicate`	Deduplicate by `PaperInfo.identity_key`.

Examples

Search recent arXiv papers in selected categories:

async with ArxivPaperClient(verbose=False) as client:
    papers = await client.search(
        "large language model",
        total_results=20,
        categories=["cs.CL", "cs.AI"],
        sort_by="submittedDate",
    )

Use a raw arXiv query:

papers = await client.search(
    'cat:cs.CL AND all:"retrieval augmented generation"',
    total_results=10,
    raw_query=True,
    sort_by="relevance",
)

Query by arXiv ID:

paper = await client.getbyid("1706.03762")
papers = await client.searchbyids(["1706.03762", "2307.09288"])

Download results:

paths = await client.download(papers[:3], output_dir="papers/arxiv")

Pass arXiv parameters through PaperClient:

from paperdl import PaperClient

async with PaperClient(
    ["arxiv"],
    client_search_kwargs={"arxiv": {"categories": ["cs.CL"], "sort_by": "relevance"}},
) as client:
    papers = await client.search("transformer", total_results=10)

3. OpenReviewPaperClient

Registered name: openreview

Import:

from paperdl.modules import OpenReviewPaperClient

Additional constructor parameters

Parameter	Description
`baseurl`	OpenReview API URL. Default: `https://api2.openreview.net`.
`username` / `password`	Credentials used when login is required.
`api_version`	Default: `2`, using `openreview.api.OpenReviewClient`.

Search parameters

await client.search(
    query=None,
    total_results=100,
    venue_id=None,
    invitation=None,
    content=None,
    details=None,
    accepted_only=False,
    client_side_filter=True,
)

Parameter	Description
`query`	Optional keyword query. By default, filtering is performed locally on title, abstract, authors, keywords, and related fields.
`venue_id`	Venue ID, for example `ICLR.cc/2024/Conference`.
`invitation`	OpenReview invitation, for example `ICLR.cc/2024/Conference/-/Submission`.
`content`	OpenReview content filter, for example `{'venueid': 'ICLR.cc/2024/Conference'}`.
`details`	`details` argument passed to OpenReview `get_all_notes`.
`accepted_only`	When used with `venue_id`, return only accepted papers.
`client_side_filter`	Whether to filter locally by `query`.

At least one of venue_id, invitation, or content must be provided.

Examples

Search a conference:

async with OpenReviewPaperClient(verbose=False) as client:
    papers = await client.search(
        "diffusion",
        venue_id="ICLR.cc/2024/Conference",
        total_results=20,
    )

Return accepted papers only:

papers = await client.search(
    venue_id="ICLR.cc/2024/Conference",
    accepted_only=True,
    total_results=100,
)

Use an invitation:

papers = await client.search(
    "reasoning",
    invitation="ICLR.cc/2024/Conference/-/Submission",
    total_results=20,
)

Download results:

paths = await client.download(papers[:5], output_dir="papers/openreview")

If the normal PDF URL fails, the client falls back to the OpenReview attachment API.

4. ACLAnthologyPaperClient

Registered name: acl_anthology

Common aliases: acl

Import:

from paperdl.modules import ACLAnthologyPaperClient

Search parameters

await client.search(
    query=None,
    total_results=100,
    collection_ids=None,
    max_collections=40,
    deduplicate=True,
)

Parameter	Description
`query`	Optional keyword query. If empty, papers in the scanned collections are returned.
`collection_ids`	ACL Anthology XML collection IDs, for example `['2024.acl-long']`.
`max_collections`	When `collection_ids` is not provided, scan this many recent collections.
`deduplicate`	Deduplicate results.

Examples

Search recent collections:

async with ACLAnthologyPaperClient(verbose=False) as client:
    papers = await client.search(
        "machine translation",
        total_results=20,
        max_collections=30,
    )

Search selected collections:

papers = await client.search(
    "large language model",
    collection_ids=["2024.acl-long", "2024.findings-acl"],
    total_results=20,
)

Download results:

paths = await client.download(papers[:5], output_dir="papers/acl")

5. BioRxivPaperClient and MedRxivPaperClient

Registered names: biorxiv, medrxiv

Import:

from paperdl.modules import BioRxivPaperClient, MedRxivPaperClient

MedRxivPaperClient inherits from BioRxivPaperClient. The main difference is that the source server is medrxiv instead of biorxiv.

Additional constructor parameters

Parameter	Description
`browser_fallback`	Whether to use a Playwright browser fallback when normal download fails.
`browser_headless`	Whether the fallback browser runs in headless mode.
`browser_channel`	Chromium channel, for example `chrome`.
`browser_user_data_dir`	Browser user data directory.
`browser_wait_seconds`	Wait time before retrying when a challenge page is detected.

Search parameters

await client.search(
    query=None,
    total_results=100,
    from_date="2024-01-01",
    to_date=None,
    max_scan_results=5000,
    page_size=100,
    deduplicate=True,
)

Parameter	Description
`query`	Optional keyword query. Matching is performed locally on title, abstract, authors, categories, and related fields.
`from_date` / `to_date`	API scan date range in `YYYY-MM-DD` format. `to_date=None` means today.
`max_scan_results`	Maximum number of source records to scan, which prevents very large date ranges from becoming too slow.
`page_size`	Number of records per API request.
`deduplicate`	Deduplicate results.

Examples

Search and download from bioRxiv:

async with BioRxivPaperClient(verbose=False) as client:
    papers = await client.search(
        "single cell",
        from_date="2025-01-01",
        total_results=20,
    )
    paths = await client.download(papers[:3], output_dir="papers/biorxiv")

Search medRxiv:

async with MedRxivPaperClient(verbose=False) as client:
    papers = await client.search(
        "medical imaging",
        from_date="2025-01-01",
        total_results=20,
    )

Disable browser fallback:

async with BioRxivPaperClient(browser_fallback=False) as client:
    papers = await client.search("protein design", total_results=5)

6. PMLRPaperClient

Registered name: pmlr

Import:

from paperdl.modules import PMLRPaperClient

Search parameters

await client.search(
    query=None,
    total_results=100,
    volume_ids=None,
    max_volumes=None,
    enrich_abstracts=True,
    deduplicate=True,
)

Parameter	Description
`query`	Optional keyword query. Matching uses title, abstract, authors, venue, and related fields.
`volume_ids`	Selected PMLR volumes, for example `[235, 238]`.
`max_volumes`	When `volume_ids` is not provided, scan this many recent volumes.
`enrich_abstracts`	Visit paper detail pages to enrich abstracts, PDF URLs, authors, year, and other metadata.
`deduplicate`	Deduplicate results.

Examples

Search recent volumes:

async with PMLRPaperClient(verbose=False) as client:
    papers = await client.search(
        "diffusion",
        max_volumes=30,
        total_results=20,
    )

Search selected volumes:

papers = await client.search(
    "reinforcement learning",
    volume_ids=[235, 238],
    total_results=20,
)

Run a lightweight search without visiting detail pages:

papers = await client.search(
    "optimization",
    max_volumes=20,
    enrich_abstracts=False,
)

Download results:

paths = await client.download(papers[:5], output_dir="papers/pmlr")

7. PMCOAPaperClient

Registered name: pmc_oa

Common aliases: pmc, pubmed, pubmed_central

Import:

from paperdl.modules import PMCOAPaperClient

Additional constructor parameters

Parameter	Description
`tool`	Tool name passed to NCBI E-utilities.
`email`	Email passed to NCBI E-utilities.
`api_key`	NCBI API key.
`api_delay`	Wait time between paginated API requests.

Search parameters

await client.search(
    query=None,
    total_results=100,
    page_size=50,
    sort="relevance",
    require_pdf=True,
    deduplicate=True,
)

Parameter	Description
`query`	PMC search query. The client automatically applies an open-access filter.
`total_results`	Maximum number of returned papers.
`page_size`	Number of results per API page. Internally capped at 200.
`sort`	NCBI search sort field. Default: `relevance`.
`require_pdf`	Keep only records with an open-access PDF download link.
`deduplicate`	Deduplicate results.

Examples

Search PMC Open Access:

async with PMCOAPaperClient(email="you@example.com", verbose=False) as client:
    papers = await client.search(
        "cancer immunotherapy",
        total_results=20,
        require_pdf=True,
    )

Use an API key:

async with PMCOAPaperClient(
    email="you@example.com",
    api_key="YOUR_NCBI_API_KEY",
    verbose=False,
) as client:
    papers = await client.search("single cell sequencing", total_results=50)

Download results:

paths = await client.download(papers[:5], output_dir="papers/pmc")

8. Combine Multiple Clients with PaperClient

Use a concrete client when you need fine-grained control over one source. Use PaperClient when you want unified search and download across multiple sources:

import asyncio
from paperdl import PaperClient

async def main():
    async with PaperClient(
        ["arxiv", "acl_anthology", "pmlr", "pmc_oa"],
        default_init_kwargs={"verbose": False},
        client_init_kwargs={
            "pmc_oa": {"email": "you@example.com"},
        },
        client_search_kwargs={
            "arxiv": {"categories": ["cs.CL"]},
            "acl_anthology": {"max_collections": 20},
            "pmlr": {"max_volumes": 30},
            "pmc_oa": {"require_pdf": True},
        },
        search_concurrency=4,
    ) as client:
        papers = await client.search("large language model", total_results=10)
        paths = await client.download(papers[:10], output_dir="papers")
        print(len(papers), len(paths))

asyncio.run(main())

Return results grouped by source:

results = await client.search(
    "diffusion",
    total_results=10,
    return_by_client=True,
)

for source, papers in results.items():
    print(source, len(papers))

9. Common PaperInfo Fields

Every search result is a PaperInfo object:

Field	Description
`source`	Source client, for example `ArxivPaperClient`.
`title`	Paper title.
`abstract`	Abstract text.
`authors`	Author list.
`article_url`	Paper detail page.
`download_url`	PDF download URL.
`doi` / `arxiv_id`	DOI or arXiv ID.
`venue` / `publisher`	Conference, journal, or publisher.
`published_at` / `updated_at`	Publication and update timestamps.
`source_id`	Source-specific internal ID.
`keywords` / `categories` / `tags`	Keywords, categories, and tags.
`extra`	Source-specific metadata.

Useful methods and properties:

paper.year
paper.main_url
paper.short_authors
paper.identity_key
paper.filename()
paper.todict()
paper.tojson()

Restore a paper from a dictionary or JSON string:

from paperdl.modules import PaperInfo

paper = PaperInfo.fromdict(data)
paper = PaperInfo.fromjson(text)