Quick Start
Paperdl is a unified asynchronous toolkit for scholarly paper search and PDF download. It can be used in two main ways:
Command line: powered by
PaperClientCMD, suitable for quick searches, saved results, and batch downloads.Python package: powered by
PaperClient, suitable for scripts, scheduled jobs, and research workflows.
Built-in client names: arxiv, openreview, acl_anthology, biorxiv, medrxiv, pmlr, and pmc_oa. The default client is arxiv.
Command Line Usage
The examples below use the paperdl command. If your development environment has not registered the console script, replace paperdl with:
python -m paperdl.paperdl
(1) List Available Clients
paperdl clients
(2) Search Papers
Search the default arXiv source:
paperdl search "diffusion model" -n 10
Search multiple sources:
paperdl search "large language model" -c arxiv,pmlr,acl_anthology -n 5
Search all registered sources:
paperdl search "retrieval augmented generation" -c all -n 3 \
--client-search-param openreview.venue_id=ICLR.cc/2024/Conference \
--client-search-param biorxiv.max_scan_results=500 \
--client-search-param medrxiv.max_scan_results=500 \
--client-search-param pmlr.max_volumes=120
When using -c all, client-specific search parameters may be required. For example, OpenReview needs a search scope such as venue_id, while clients such as bioRxiv, medRxiv, and PMLR may need scan limits to keep the search fast.
Print JSON or JSONL:
paperdl search "transformer" -c arxiv -n 5 --format json
paperdl search "transformer" -c arxiv -n 5 --format jsonl
Save search results for later download:
paperdl search "graph neural network" -c arxiv,pmlr -n 10 --output-json outputs/search_results.json
Show only the first few rows in the terminal while saving all results:
paperdl search "multimodal large language model" -c arxiv -n 50 --limit 10 --output-json outputs/mllm.json
Pass common search parameters to every selected client:
paperdl search "large language model" -c arxiv -n 20 --search-param sort_by=relevance --search-param page_size=20
Pass per-client search parameters. For macOS/Linux/Git Bash:
paperdl search "diffusion" -c arxiv,pmlr -n 3 \
--client-search-param 'arxiv.categories=["cs.CV","cs.LG"]' \
--client-search-param pmlr.max_volumes=30
For Windows cmd:
paperdl search "diffusion" -c arxiv,pmlr -n 3 ^
--client-search-param "arxiv.categories=[\"cs.CV\",\"cs.LG\"]" ^
--client-search-param pmlr.max_volumes=30
You can also pass per-client parameters as JSON. For macOS/Linux/Git Bash:
paperdl search "diffusion" -c arxiv,pmlr -n 3 \
--client-search-kwargs '{"arxiv":{"categories":["cs.CV","cs.LG"]},"pmlr":{"max_volumes":30}}'
For Windows cmd:
paperdl search "diffusion" -c arxiv,pmlr -n 3 ^
--client-search-kwargs "{\"arxiv\":{\"categories\":[\"cs.CV\",\"cs.LG\"]},\"pmlr\":{\"max_volumes\":30}}"
On Windows cmd, do not use single quotes around JSON-like values. Use double quotes around the whole argument and escape inner double quotes with \".
(3) Download Papers
Search and download all returned papers:
paperdl download "diffusion model" -c arxiv -n 5 -o papers
Download only the first three results:
paperdl download "diffusion model" -c arxiv -n 20 --select top3 -o papers
Download selected result indices shown in the preview table:
paperdl download "diffusion model" -c arxiv -n 20 --select 1,3-5 -o papers
Download from a saved search result file:
paperdl download --input-json outputs/search_results.json --select top10 -o papers
Overwrite existing PDF files:
paperdl download "attention is all you need" -c arxiv -n 1 -o papers --overwrite
Run in quiet mode:
paperdl download "diffusion" -c arxiv,pmlr -n 5 --quiet -o papers
Stop immediately when any selected client fails:
paperdl download "diffusion" -c arxiv,pmlr -n 5 --raise-on-error
(4) Common CLI Options
| Option | Purpose |
|---|---|
-c, --clients |
Comma-separated client names, or all. Default: arxiv. |
-n, --total-results |
Default number of results per client. |
--output-json |
Save search results to a JSON file. |
--input-json |
Load paper records from JSON when running download. |
--format |
Output format: table, json, or jsonl. |
--select |
Download selection, such as all, top10, or 1,3-5. |
-o, --output-dir |
Output directory for PDFs. |
--overwrite |
Overwrite existing PDF files. |
--no-dedupe |
Disable cross-client deduplication. |
--quiet |
Disable verbose logs and progress output where possible. |
--search-concurrency |
Number of clients searched concurrently. |
--init-param / --search-param |
Constructor or search parameter applied to all clients. |
--client-init-param / --client-search-param |
Constructor or search parameter applied to one client. |
--init-kwargs / --search-kwargs |
JSON object applied to all clients. |
--client-init-kwargs / --client-search-kwargs |
JSON object keyed by client name. |
Python Package Usage
(1) Minimal Search Example
import asyncio
from paperdl import PaperClient
async def main():
async with PaperClient(["arxiv"], default_init_kwargs={"verbose": False}) as client:
papers = await client.search("diffusion model", total_results=5)
for paper in papers:
print(paper.title, paper.article_url, paper.download_url)
asyncio.run(main())
client.search(...) returns a list of PaperInfo objects. Common fields include title, abstract, authors, article_url, download_url, doi, arxiv_id, venue, published_at, and source.
(2) Search Multiple Sources
import asyncio
from paperdl import PaperClient
async def main():
async with PaperClient(["arxiv", "pmlr", "acl_anthology"]) as client:
papers = await client.search("large language model", total_results=5)
print(f"found {len(papers)} papers")
asyncio.run(main())
Return results grouped by client:
results = await client.search(
"large language model",
total_results=5,
return_by_client=True,
)
print(results["arxiv"])
print(results["pmlr"])
(3) Search and Download
import asyncio
from paperdl import PaperClient
async def main():
async with PaperClient(["arxiv"]) as client:
papers = await client.search("attention is all you need", total_results=1)
paths = await client.download(papers, output_dir="papers")
print(paths)
asyncio.run(main())
Run search and download in one call:
papers, paths = await client.searchanddownload(
"diffusion model",
clients=["arxiv"],
total_results=5,
output_dir="papers",
)
(4) Save and Load Search Results
from paperdl import PaperClient
PaperClient.saveresults(papers, "outputs/search_results.json")
loaded_papers = PaperClient.loadresults("outputs/search_results.json")
Loaded results can be downloaded later:
async with PaperClient(["arxiv", "pmlr", "acl_anthology"]) as client:
loaded_papers = PaperClient.loadresults("outputs/search_results.json")
paths = await client.download(loaded_papers, output_dir="papers")
(5) Configure Different Clients Differently
import asyncio
from paperdl import PaperClient
async def main():
async with PaperClient(
["arxiv", "pmlr"],
default_init_kwargs={"verbose": False, "show_progress": False},
client_search_kwargs={
"arxiv": {"categories": ["cs.CL", "cs.AI"], "sort_by": "submittedDate"},
"pmlr": {"max_volumes": 30, "enrich_abstracts": True},
},
search_concurrency=2,
) as client:
papers = await client.search("large language model", total_results=10)
await client.download(papers[:5], output_dir="papers")
asyncio.run(main())
Override search parameters for a single call:
papers = await client.search(
"diffusion",
total_results=10,
client_search_kwargs={
"arxiv": {"categories": ["cs.CV"]},
"pmlr": {"max_volumes": 20},
},
)
(6) Error Handling
By default, a failed client does not stop other clients. Search errors are stored in client.last_errors.
papers = await client.search("diffusion", total_results=5)
if client.last_errors:
for name, err in client.last_errors.items():
print(name, err)
Raise immediately on failure:
papers = await client.search("diffusion", total_results=5, raise_on_error=True)
paths = await client.download(papers, output_dir="papers", raise_on_error=True)
Keep download exceptions in the returned list:
results = await client.download(
papers,
output_dir="papers",
return_exceptions=True,
)
(7) Use PaperClientCMD from Python
PaperClientCMD is the Python wrapper behind the command line interface. It is useful when you want to reuse CLI behavior inside another script:
from paperdl import PaperClientCMD
PaperClientCMD(["clients"]).run()
PaperClientCMD(["search", "diffusion model", "-c", "arxiv", "-n", "5"]).run()
PaperClientCMD(["download", "diffusion model", "-c", "arxiv", "-n", "3", "-o", "papers"]).run()
Next Steps
For quick usage, start with the CLI and
PaperClientexamples in this file.For source-specific search options, see
Clients.md.To add a new paper source, subclass
BasePaperClient, implementsearchanddownloaditem, and register it in the client registry.