# Paperdl Clients This document summarizes the built-in Paperdl clients, their search/download arguments, and practical examples. For a short usage-oriented introduction, read `QuickStart.md` first. ## 1. Common Model All built-in clients inherit from `BasePaperClient` and follow the same basic pattern: ```python import asyncio from paperdl.modules import ArxivPaperClient async def main(): async with ArxivPaperClient(verbose=False, show_progress=False) as client: papers = await client.search("diffusion model", total_results=5) paths = await client.download(papers, output_dir="papers") print(paths) asyncio.run(main()) ``` Common download API: ```python paths = await client.download( papers, output_dir="paperdl_outputs", overwrite=False, return_exceptions=False, ) ``` Common constructor arguments: | Parameter | Description | | --- | --- | | `timeout` | Total HTTP request timeout. | | `concurrency` | Internal per-client concurrency for requests and downloads. | | `max_retries` | Number of retries after request or download failures. | | `retry_backoff` | Base wait time for exponential backoff. | | `headers` | Custom HTTP headers. | | `cookies` / `cookie_file` | Cookies passed directly, or a cookie file to load and save. | | `proxy` | Proxy URL, for example `http://127.0.0.1:7890`. | | `thread_workers` | Thread pool size for blocking tasks. | | `show_progress` | Whether to show rich progress bars. | | `progress_mode` | Progress mode: `auto`, `summary`, `detailed`, or `none`. | | `max_detail_tasks` | Maximum number of detailed tasks shown in `auto` mode. | | `verbose` | Whether to print logs. | `download(...)` calls the client-specific `downloaditem(...)` internally, so direct calls to `downloaditem` are rarely needed. ## 2. ArxivPaperClient Registered name: `arxiv` Import: ```python from paperdl.modules import ArxivPaperClient ``` #### Search arguments ```python await client.search( query, total_results=100, page_size=50, categories=None, search_field="all", sort_by="submittedDate", sort_order="descending", raw_query=False, deduplicate=True, ) ``` | Parameter | Description | | --- | --- | | `query` | Search keywords. | | `total_results` | Maximum number of returned papers. | | `page_size` | Number of results requested per API page. | | `categories` | arXiv categories, for example `['cs.CL', 'cs.CV']`. | | `search_field` | arXiv search field. Default: `all`. | | `sort_by` | `relevance`, `lastUpdatedDate`, or `submittedDate`. | | `sort_order` | `ascending` or `descending`. | | `raw_query` | If `True`, use the raw arXiv query string without automatic field/category composition. | | `deduplicate` | Deduplicate by `PaperInfo.identity_key`. | #### Examples Search recent arXiv papers in selected categories: ```python async with ArxivPaperClient(verbose=False) as client: papers = await client.search( "large language model", total_results=20, categories=["cs.CL", "cs.AI"], sort_by="submittedDate", ) ``` Use a raw arXiv query: ```python papers = await client.search( 'cat:cs.CL AND all:"retrieval augmented generation"', total_results=10, raw_query=True, sort_by="relevance", ) ``` Query by arXiv ID: ```python paper = await client.getbyid("1706.03762") papers = await client.searchbyids(["1706.03762", "2307.09288"]) ``` Download results: ```python paths = await client.download(papers[:3], output_dir="papers/arxiv") ``` Pass arXiv parameters through `PaperClient`: ```python from paperdl import PaperClient async with PaperClient( ["arxiv"], client_search_kwargs={"arxiv": {"categories": ["cs.CL"], "sort_by": "relevance"}}, ) as client: papers = await client.search("transformer", total_results=10) ``` ## 3. OpenReviewPaperClient Registered name: `openreview` Import: ```python from paperdl.modules import OpenReviewPaperClient ``` #### Additional constructor parameters | Parameter | Description | | --- | --- | | `baseurl` | OpenReview API URL. Default: `https://api2.openreview.net`. | | `username` / `password` | Credentials used when login is required. | | `api_version` | Default: `2`, using `openreview.api.OpenReviewClient`. | #### Search parameters ```python await client.search( query=None, total_results=100, venue_id=None, invitation=None, content=None, details=None, accepted_only=False, client_side_filter=True, ) ``` | Parameter | Description | | --- | --- | | `query` | Optional keyword query. By default, filtering is performed locally on title, abstract, authors, keywords, and related fields. | | `venue_id` | Venue ID, for example `ICLR.cc/2024/Conference`. | | `invitation` | OpenReview invitation, for example `ICLR.cc/2024/Conference/-/Submission`. | | `content` | OpenReview content filter, for example `{'venueid': 'ICLR.cc/2024/Conference'}`. | | `details` | `details` argument passed to OpenReview `get_all_notes`. | | `accepted_only` | When used with `venue_id`, return only accepted papers. | | `client_side_filter` | Whether to filter locally by `query`. | At least one of `venue_id`, `invitation`, or `content` must be provided. #### Examples Search a conference: ```python async with OpenReviewPaperClient(verbose=False) as client: papers = await client.search( "diffusion", venue_id="ICLR.cc/2024/Conference", total_results=20, ) ``` Return accepted papers only: ```python papers = await client.search( venue_id="ICLR.cc/2024/Conference", accepted_only=True, total_results=100, ) ``` Use an invitation: ```python papers = await client.search( "reasoning", invitation="ICLR.cc/2024/Conference/-/Submission", total_results=20, ) ``` Download results: ```python paths = await client.download(papers[:5], output_dir="papers/openreview") ``` If the normal PDF URL fails, the client falls back to the OpenReview attachment API. ## 4. ACLAnthologyPaperClient Registered name: `acl_anthology` Common aliases: `acl` Import: ```python from paperdl.modules import ACLAnthologyPaperClient ``` #### Search parameters ```python await client.search( query=None, total_results=100, collection_ids=None, max_collections=40, deduplicate=True, ) ``` | Parameter | Description | | --- | --- | | `query` | Optional keyword query. If empty, papers in the scanned collections are returned. | | `collection_ids` | ACL Anthology XML collection IDs, for example `['2024.acl-long']`. | | `max_collections` | When `collection_ids` is not provided, scan this many recent collections. | | `deduplicate` | Deduplicate results. | #### Examples Search recent collections: ```python async with ACLAnthologyPaperClient(verbose=False) as client: papers = await client.search( "machine translation", total_results=20, max_collections=30, ) ``` Search selected collections: ```python papers = await client.search( "large language model", collection_ids=["2024.acl-long", "2024.findings-acl"], total_results=20, ) ``` Download results: ```python paths = await client.download(papers[:5], output_dir="papers/acl") ``` ## 5. BioRxivPaperClient and MedRxivPaperClient Registered names: `biorxiv`, `medrxiv` Import: ```python from paperdl.modules import BioRxivPaperClient, MedRxivPaperClient ``` `MedRxivPaperClient` inherits from `BioRxivPaperClient`. The main difference is that the source server is `medrxiv` instead of `biorxiv`. #### Additional constructor parameters | Parameter | Description | | --- | --- | | `browser_fallback` | Whether to use a Playwright browser fallback when normal download fails. | | `browser_headless` | Whether the fallback browser runs in headless mode. | | `browser_channel` | Chromium channel, for example `chrome`. | | `browser_user_data_dir` | Browser user data directory. | | `browser_wait_seconds` | Wait time before retrying when a challenge page is detected. | #### Search parameters ```python await client.search( query=None, total_results=100, from_date="2024-01-01", to_date=None, max_scan_results=5000, page_size=100, deduplicate=True, ) ``` | Parameter | Description | | --- | --- | | `query` | Optional keyword query. Matching is performed locally on title, abstract, authors, categories, and related fields. | | `from_date` / `to_date` | API scan date range in `YYYY-MM-DD` format. `to_date=None` means today. | | `max_scan_results` | Maximum number of source records to scan, which prevents very large date ranges from becoming too slow. | | `page_size` | Number of records per API request. | | `deduplicate` | Deduplicate results. | #### Examples Search and download from bioRxiv: ```python async with BioRxivPaperClient(verbose=False) as client: papers = await client.search( "single cell", from_date="2025-01-01", total_results=20, ) paths = await client.download(papers[:3], output_dir="papers/biorxiv") ``` Search medRxiv: ```python async with MedRxivPaperClient(verbose=False) as client: papers = await client.search( "medical imaging", from_date="2025-01-01", total_results=20, ) ``` Disable browser fallback: ```python async with BioRxivPaperClient(browser_fallback=False) as client: papers = await client.search("protein design", total_results=5) ``` ## 6. PMLRPaperClient Registered name: `pmlr` Import: ```python from paperdl.modules import PMLRPaperClient ``` #### Search parameters ```python await client.search( query=None, total_results=100, volume_ids=None, max_volumes=None, enrich_abstracts=True, deduplicate=True, ) ``` | Parameter | Description | | --- | --- | | `query` | Optional keyword query. Matching uses title, abstract, authors, venue, and related fields. | | `volume_ids` | Selected PMLR volumes, for example `[235, 238]`. | | `max_volumes` | When `volume_ids` is not provided, scan this many recent volumes. | | `enrich_abstracts` | Visit paper detail pages to enrich abstracts, PDF URLs, authors, year, and other metadata. | | `deduplicate` | Deduplicate results. | #### Examples Search recent volumes: ```python async with PMLRPaperClient(verbose=False) as client: papers = await client.search( "diffusion", max_volumes=30, total_results=20, ) ``` Search selected volumes: ```python papers = await client.search( "reinforcement learning", volume_ids=[235, 238], total_results=20, ) ``` Run a lightweight search without visiting detail pages: ```python papers = await client.search( "optimization", max_volumes=20, enrich_abstracts=False, ) ``` Download results: ```python paths = await client.download(papers[:5], output_dir="papers/pmlr") ``` ## 7. PMCOAPaperClient Registered name: `pmc_oa` Common aliases: `pmc`, `pubmed`, `pubmed_central` Import: ```python from paperdl.modules import PMCOAPaperClient ``` #### Additional constructor parameters | Parameter | Description | | --- | --- | | `tool` | Tool name passed to NCBI E-utilities. | | `email` | Email passed to NCBI E-utilities. | | `api_key` | NCBI API key. | | `api_delay` | Wait time between paginated API requests. | #### Search parameters ```python await client.search( query=None, total_results=100, page_size=50, sort="relevance", require_pdf=True, deduplicate=True, ) ``` | Parameter | Description | | --- | --- | | `query` | PMC search query. The client automatically applies an open-access filter. | | `total_results` | Maximum number of returned papers. | | `page_size` | Number of results per API page. Internally capped at 200. | | `sort` | NCBI search sort field. Default: `relevance`. | | `require_pdf` | Keep only records with an open-access PDF download link. | | `deduplicate` | Deduplicate results. | #### Examples Search PMC Open Access: ```python async with PMCOAPaperClient(email="you@example.com", verbose=False) as client: papers = await client.search( "cancer immunotherapy", total_results=20, require_pdf=True, ) ``` Use an API key: ```python async with PMCOAPaperClient( email="you@example.com", api_key="YOUR_NCBI_API_KEY", verbose=False, ) as client: papers = await client.search("single cell sequencing", total_results=50) ``` Download results: ```python paths = await client.download(papers[:5], output_dir="papers/pmc") ``` ## 8. Combine Multiple Clients with PaperClient Use a concrete client when you need fine-grained control over one source. Use `PaperClient` when you want unified search and download across multiple sources: ```python import asyncio from paperdl import PaperClient async def main(): async with PaperClient( ["arxiv", "acl_anthology", "pmlr", "pmc_oa"], default_init_kwargs={"verbose": False}, client_init_kwargs={ "pmc_oa": {"email": "you@example.com"}, }, client_search_kwargs={ "arxiv": {"categories": ["cs.CL"]}, "acl_anthology": {"max_collections": 20}, "pmlr": {"max_volumes": 30}, "pmc_oa": {"require_pdf": True}, }, search_concurrency=4, ) as client: papers = await client.search("large language model", total_results=10) paths = await client.download(papers[:10], output_dir="papers") print(len(papers), len(paths)) asyncio.run(main()) ``` Return results grouped by source: ```python results = await client.search( "diffusion", total_results=10, return_by_client=True, ) for source, papers in results.items(): print(source, len(papers)) ``` ## 9. Common PaperInfo Fields Every search result is a `PaperInfo` object: | Field | Description | | --- | --- | | `source` | Source client, for example `ArxivPaperClient`. | | `title` | Paper title. | | `abstract` | Abstract text. | | `authors` | Author list. | | `article_url` | Paper detail page. | | `download_url` | PDF download URL. | | `doi` / `arxiv_id` | DOI or arXiv ID. | | `venue` / `publisher` | Conference, journal, or publisher. | | `published_at` / `updated_at` | Publication and update timestamps. | | `source_id` | Source-specific internal ID. | | `keywords` / `categories` / `tags` | Keywords, categories, and tags. | | `extra` | Source-specific metadata. | Useful methods and properties: ```python paper.year paper.main_url paper.short_authors paper.identity_key paper.filename() paper.todict() paper.tojson() ``` Restore a paper from a dictionary or JSON string: ```python from paperdl.modules import PaperInfo paper = PaperInfo.fromdict(data) paper = PaperInfo.fromjson(text) ```