Elicit AI Review 2026: 4 Accuracy Tests

I approached this Elicit AI review 2026 as a test of one clear proposition: can a specialised research assistant reduce the mechanical labour of evidence synthesis without weakening the audit trail that makes a systematic review trustworthy? This article examines Elicit’s search, title and abstract screening, full-text screening, structured extraction, reports, library, alerts, exports, API, integrations, pricing and operational limits. It also explains exactly what the widely quoted 95%, 97%, 99% and 96% figures measure, where they do not generalise, and which researchers gain enough time to justify a paid plan.

The answer is more nuanced than the marketing shorthand. Elicit is one of the strongest purpose-built tools available for systematic literature reviews, especially when a team needs reproducible screening decisions, source-linked extraction tables and PRISMA-oriented outputs. Its official May 2026 evaluation used 994 open-access Cochrane reviews and reported 95.0% search recall, 96.9% abstract-screening sensitivity, 99.5% full-text paper-level sensitivity and 95.6% correctness on selected extraction tasks. Those are impressive results, but they are not interchangeable measures of overall accuracy. Full-text specificity was 70.1%, and the extraction benchmark ultimately relied on 198 studies whose PDFs were accessible.

During this 2026 evaluation, I reviewed Elicit’s public product pages, pricing states, API specification, validation methods, recent product announcements and external evidence-synthesis guidance. I did not run an authenticated, large-scale review inside a paid workspace, so interface observations are limited to publicly reproducible documentation. The practical verdict is still clear: Elicit can replace a large share of repetitive discovery, screening and extraction work, but it does not replace protocol design, domain judgement, risk-of-bias assessment, statistical synthesis, reference management or accountable human review.

What Elicit AI Is in 2026

Elicit is a scientific research platform built around structured evidence workflows rather than open-ended chat. Its current index covers more than 138 million academic papers, with a separate clinical-trials search covering more than 500,000 records. A researcher can start with a natural-language question, use semantic or keyword search, filter papers, screen them against eligibility criteria, extract fields into a comparison table, generate a report and preserve links back to the supporting sentences, tables or figures.

That workflow is the reason Elicit belongs in a different category from general assistants. ChatGPT, Claude and Gemini can reason over documents and help draft prose, but they do not inherently enforce a review protocol, maintain a screening ledger or produce a PRISMA-ready record of exclusions. Elicit’s closer competitors are evidence-synthesis and academic-search products such as Consensus, Rayyan, Covidence, Nested Knowledge and specialist research workspaces. Our broader ranking of the best AI research tools helps place those categories side by side.

The product has also changed enough that several common criticisms are now outdated. Elicit has a Library for storing and organising sources, can import from Zotero, exports RIS, CSV, BIB, PDF and DOCX on eligible plans, and offers paper alerts. It is therefore inaccurate to describe the 2026 product as having no persistent library at all. The more precise limitation is that Elicit is not a full reference manager: it does not replace Zotero’s mature citation styles, word-processor plugins, group-library conventions or long-term bibliographic governance.

Elicit is used most naturally in health, life sciences, policy research and other fields with structured study designs. It can search broadly across disciplines, but its strongest validation evidence is still medical and Cochrane-centred. Researchers in law, history, humanities or internal corporate-document analysis should not assume that a benchmark built on intervention reviews transfers unchanged to archival sources, case law, qualitative corpora or unpublished files.

Elicit AI Review 2026 Verdict

Elicit is excellent at turning a research question and a paper set into an auditable screening-and-extraction workspace, but it remains a component of a research stack rather than a complete scholarly operating system.

Elicit AI Review 2026: Who Should Pay for It?

The strongest fit is a researcher who repeatedly faces the same expensive bottleneck: hundreds or thousands of candidate papers, explicit inclusion criteria, a defined extraction schema and a requirement to show how each judgement was reached. Graduate students, systematic-review teams, health-technology assessment groups, pharmaceutical evidence teams and policy analysts can all benefit because Elicit converts unstructured papers into comparable rows and columns while retaining source provenance.

For a student writing a conventional narrative literature review, the free tier may be enough to map a topic, summarise papers and chat with accessible full texts. A doctoral researcher planning a formal review should budget for Pro or institutional access because the value lies in the dedicated workflow, higher paper limits, custom extraction and exports. The practical division of labour also matters: our guide to AI tools for students treats research discovery, writing, citation management and study support as separate jobs rather than pretending one subscription handles all of them.

Elicit is less compelling for someone who only needs a quick answer from a few papers, already has a stable Covidence or Rayyan workflow, or mainly wants to draft polished academic prose. It can generate research reports, but its core advantage is evidence handling, not voice, argument development or publication-ready writing. A user who expects a one-click thesis may find the product disciplined but incomplete.

The economic case depends on review frequency. Pro’s annual quota of 144 reports or systematic reviews is generous for an individual researcher, while Scale adds collaboration, figure interpretation, higher report and column limits, and usage administration. Enterprise is where the product becomes a high-volume evidence pipeline, with screening of up to 40,000 papers, custom data sources, stronger security controls and unlimited API access. The paid decision should therefore be based on the number of formal workflows, not on the number of casual searches, because Basic already offers unlimited search across the paper index.

My bottom line is that Elicit is worth paying for when missed studies, undocumented exclusions and manual extraction hours are material risks. It is harder to justify when a user needs a writing companion first and a review engine second.

Best-fit users

Systematic reviewers, evidence-synthesis teams, PhD candidates, clinical and policy researchers, and R&D groups with repeatable extraction schemas.

Poor-fit users

Researchers seeking full manuscript generation, legal-document review, comprehensive citation management or a substitute for specialist databases and librarians.

How the Systematic Review Workflow Works

A rigorous Elicit workflow begins before the platform. Define the review question, protocol, populations, interventions or exposures, comparators, outcomes, study designs, date limits, languages and exclusion rules. Elicit can help operationalise those choices, but it cannot decide which methodological commitments are defensible for a discipline. A vague question produces a vague search and an unstable extraction table.

Step 1 is source gathering. Use semantic search for conceptually related papers, keyword search for reproducible Boolean logic, PubMed restriction when appropriate, and imported records from databases or Zotero. Elicit’s 2026 API also supports a dedicated clinical-trials endpoint. For formal reviews, run more than one strategy and record each query, date and corpus. A useful companion is our doctoral research workflow guide, which explains why AI discovery should sit alongside library databases rather than replace them.

Step 2 is title and abstract screening. Enter explicit criteria and inspect the model’s include, exclude or maybe decisions. Elicit treats maybe as screen-positive because false negatives are more damaging at this stage. Human reviewers should sample clear inclusions, clear exclusions and uncertain cases, then refine ambiguous criteria before scaling to the full set.

Step 3 is full-text screening. The system retrieves available PDFs, applies criteria to the complete article and records an exclusion reason with supporting text. PRISMA-oriented teams can use dual review, with Elicit supporting two humans or acting as a second reviewer. The aim is not to eliminate disagreement. It is to make disagreement visible and resolvable.

Step 4 is data extraction. Create columns for study design, sample size, population, intervention, comparator, outcomes, follow-up, effect estimates, limitations and any review-specific variables. Elicit extracts from prose, tables and, on eligible plans, figures. Every value should be checked against its highlighted source, especially units, denominators, subgroup labels and time points.

Step 5 is synthesis and export. Reports can synthesise up to the plan’s source limit, while systematic-review exports can include screening files, extraction tables, bibliographies, a PRISMA flow diagram and a Word or PDF report. Keep a local archival copy because API-generated export links are presigned and expire after seven days.

A reproducible operating sequence

Protocol first, multiple searches second, calibrated screening third, verified extraction fourth, synthesis last. Reversing that order makes an attractive report easier to produce but harder to defend.

What the 2026 Accuracy Metrics Actually Mean

Elicit’s May 2026 evaluation is unusually detailed for a commercial research tool. The team sampled 1,000 open-access Cochrane reviews across 12 MeSH areas, removed duplicates and retained 994 unique reviews covering 38,493 study records. It then created stage-specific datasets because search, abstract screening, full-text screening and extraction require different ground truths. This is important: the headline percentages do not come from one universal test set.

Search recall was 95.0% for included studies when the review title alone was used as the semantic query. That is a conservative prompt, but the scoring subset depended on confidently resolved identifiers. Only 56.7% of listed studies could be matched to one unambiguous paper with a DOI. The result therefore says Elicit found most known included studies within the resolvable subset; it does not prove exhaustive coverage of every database, grey-literature source or non-indexed paper.

Abstract screening achieved 96.89% sensitivity, 92.54% specificity, 70.09% precision and 93.21% accuracy across 931 positives and 5,162 negatives. Sensitivity is the key early-screening metric because a false negative can remove a relevant study permanently. Precision is lower because the system deliberately sends uncertain papers forward for human checking.

Full-text screening reached 99.5% paper-level sensitivity and 70.1% specificity. Per-criterion accuracy was 94.8%. The high sensitivity is valuable, but the lower specificity means reviewers should expect extra papers to survive into manual adjudication. The final full-text evaluation included 377 papers across 74 reviews, a much smaller base than the top-line corpus.

Extraction scored 95.6% correct for Methods, Participants and Interventions. The original pool narrowed sharply because the evaluator needed open-access PDFs and usable full text: 3,241 DOI-resolved studies became 198 accessible studies across 98 reviews, producing 769 extraction answers. Questions were reconstructed from Cochrane tables rather than taken from original extraction forms, and a separate language model graded answers before a limited manual audit.

Those caveats do not invalidate the results. They make them interpretable. The safest summary is that Elicit shows strong recall and extraction performance on medical systematic-review tasks, with public evidence of known error modes. It is not a 96% guarantee for every field, document type or custom column.

Stage	Headline metric	Supporting metrics	Dataset and practical meaning
Search	95.0% recall	79.8% of reviews had 100% of included studies found	Semantic search using review titles; identifier-resolvable Cochrane subset.
Abstract screening	96.89% sensitivity	92.54% specificity; 70.09% precision	Designed to minimise missed studies, so more false positives move forward.
Full-text screening	99.5% paper-level sensitivity	70.1% specificity; 94.8% criterion accuracy	Excellent retention of relevant papers, but substantial human adjudication remains.
Data extraction	95.6% correct	Methods, Participants and Interventions	Open-access subset of 198 studies; not a universal score for all extraction fields.

Features, Technical Specifications and Data Coverage

Elicit now spans discovery, analysis, monitoring and programmatic access. Search supports natural-language semantic retrieval, Lucene keyword mode, the full Elicit corpus, PubMed restriction and clinical-trial search. API filters cover dates, journal quartile, include or exclude keywords, PDF availability, retractions and study tags including reviews, meta-analyses, systematic reviews, RCTs and longitudinal studies.

Paper analysis includes summaries, eligible full-text chat, sentence-level citations, source viewing and custom table columns. Reports automate search, screening, extraction and synthesis. Research Agent supports iterative exploration, Alerts monitor new work, and Library stores sources for reuse.

Systematic Review adds eligibility criteria, title and abstract screening, full-text screening, include, exclude or maybe decisions, dual review, exclusion reasons, supporting quotations, full-text retrieval, structured extraction and auditable exports. Enterprise reaches 40,000 papers and 40 columns. Report synthesis is capped separately at 135 sources on Pro and 200 on Scale.

Custom extraction reads prose and tables. Scale and Enterprise explicitly add figure interpretation. Reviewers should verify visual values because transformed axes, censored observations, overlapping series and poor image quality can produce confident but wrong answers.

Eligible plans export RIS, CSV, BIB, PDF and DOCX. The API adds CSV and XLSX stage exports plus report outputs in BibTeX, DOCX, PDF, RIS and text. It uses bearer authentication and asynchronous polling for reports and reviews.

Enterprise adds SSO or SAML, two-factor authentication, domain verification, usage analytics, custom deployments, custom data sources and a stated default of not training on customer data.

Capability	Availability or scale	Important constraint
Academic paper search	138M+ papers; semantic and keyword modes	No academic index is complete; database-specific searching remains necessary.
Clinical-trial search	500K+ records	Record coverage does not guarantee complete result reporting or publication linkage.
Interactive tables	2, 5, 20, 30 or 40 columns depending on plan	Columns at a time and extraction scale vary by plan.
Systematic screening	5,000 papers on Pro; up to 40,000 Enterprise	High recall still requires conflict resolution and audit sampling.
Reports	2 or 4 monthly on lower plans; 144 or 240 yearly on paid research plans	Report source caps are 135 on Pro and 200 on Scale.
Figures	Explicitly listed on Scale and Enterprise	Visual extraction is sensitive to low resolution and complex charts.
Library and alerts	Library on current product; 10 alerts on Pro; unlimited Enterprise	Not a replacement for full citation-management governance.

Elicit Pricing in 2026: Plans, Caps and Ambiguities

Older credit-based price summaries are no longer current. Elicit now sells Basic, Plus, Pro, Scale and Enterprise through capability and workflow quotas. Because the official page changes values with billing and audience selectors, the final checkout state is more authoritative than cached snippets.

Basic includes unlimited search across more than 138 million papers, unlimited summaries, eligible full-text chat, source viewing, Zotero import, two automated reports monthly and two table columns. Plus adds exports, four reports monthly, five columns and clinical-trial search. The static annual view lists Plus at $7 per user per month, billed as $84 annually.

Pro is the first systematic-review plan. It is listed at $29 per user monthly when billed annually, or $49 month to month. Limits include 5,000-paper screening, 144 reports or reviews yearly, 20 columns, 135 report sources, 10 alerts, custom extraction, templates and API access.

Scale is listed at $49 per user monthly when billed annually and $169 month to month. It adds full Research Agent access, figure interpretation, live collaboration, 240 annual workflows, 200 report sources, 30 columns and administration. Teams should capture the checkout quote before procurement.

Enterprise is custom priced, with 40,000-paper screening, 40 columns, unlimited alerts and API access, custom sources and templates, security, analytics and deployment options.

One entitlement conflict remains. The web workflow is described for Pro, Scale and Enterprise, but the API reference lists systematic-review API access as Enterprise-only. Automated-review buyers should confirm endpoint access separately.

Plan	Official price presentation	Key limits and inclusions	Best fit
Basic	Free	2 reports/month; 2 columns; unlimited search, summaries and eligible full-text chat; Zotero import	Occasional discovery and paper comparison.
Plus	$7/user/month billed annually; monthly figure not clearly exposed in static output	4 reports/month; 5 columns; exports; clinical-trial search	Individual researchers needing exports but not formal SLR scale.
Pro	$29/user/month billed annually or $49 monthly	5,000-paper screening; 144 workflows/year; 20 columns; 135 report sources; 10 alerts; API	Active systematic reviewers and PhD researchers.
Scale	$49/user/month billed annually or $169 monthly	240 workflows/year; 30 columns; 200 report sources; figures; collaboration; admin	Labs and research teams.
Enterprise	Custom	40,000-paper screening; 40 columns; unlimited alerts and API; SSO/SAML; custom sources	Regulated and high-volume organisations.

API, MCP and Integration Workflows

Elicit’s 2026 API materially changes the product from a browser tool into research infrastructure. The REST API exposes paper search, clinical-trial search, reports and systematic-review operations, while the Model Context Protocol server makes the same functionality available to compatible clients such as Claude Desktop and Claude Code. A bearer token authenticates REST requests; MCP uses OAuth 2.0.

The search endpoint is synchronous. It returns structured JSON with titles, authors, abstracts, citation counts, DOI, PMID, venue, year and source URLs. Pro can request up to 100 results per call and make 100 search requests in a rolling day. Scale raises both figures to 200. Enterprise can request up to 10,000 results and has no daily limit. Search and clinical-trial calls share the same rate-limit bucket, and response headers expose the current limit, remaining quota and reset timestamp.

Reports are asynchronous. A client submits a research question with maximum search and extraction paper counts, receives a report identifier and polls until completion. Official guidance says a report typically takes five to fifteen minutes. This pattern suits scheduled evidence briefs, Slack bots and internal dashboards, but it requires retry logic, status handling and storage outside the expiring download links.

The documentation also exposes stage-level systematic-review data. A client can monitor gathering, abstract screening, full-text screening, extraction, report generation and completion. CSV and XLSX exports are available for stages, and report outputs include BibTeX, DOCX, PDF, RIS and text. Presigned file URLs expire after seven days, so a production integration must download and archive them promptly.

Elicit suggests integrations with Claude, ChatGPT, Python scripts, Slack, internal databases and downstream analysis tools. The MCP route is particularly useful for researchers who want a conversational interface but need Elicit’s structured retrieval behind it. Our analysis of Claude AI for research explains why this division works: Claude can critique and draft, while Elicit supplies a more controlled evidence pipeline.

Zotero import is built into the product, but public documentation does not establish full two-way Zotero synchronisation. Treat import as an ingestion step unless Elicit confirms bidirectional updates for your account. For long-term libraries, keep Zotero or another reference manager as the system of record.

Minimal production checklist

Store API keys in a secret manager, enforce quotas, log queries and protocol versions, archive outputs before links expire, retain the Elicit session URL for human review, and test failures on missing PDFs and rate-limit responses.

Known Constraints and Performance Bottlenecks

The first bottleneck is corpus coverage. Elicit’s 138 million-paper index is large, but a systematic review can still require Embase, Web of Science, Scopus, discipline-specific databases, registries, conference proceedings, theses and grey literature. The right comparison is not whether Elicit has many papers; it is whether it covers every source named in the protocol. Our Perplexity versus Google Scholar comparison makes the same distinction between convenient synthesis and comprehensive discovery.

The second bottleneck is full-text access. Elicit can only analyse what it can retrieve or what a user uploads. Paywalls, damaged PDFs, scanned documents, multi-column parsing failures and supplementary files can break the pipeline. The 2026 extraction benchmark illustrates the effect: thousands of DOI-resolved studies narrowed to 198 studies with accessible, parseable PDFs.

The third is criterion quality. Screening models cannot rescue eligibility rules that are internally inconsistent, under-specified or changed after reviewers see the evidence. Teams should pilot criteria on a mixed sample and freeze a versioned protocol before bulk screening. When rules change, rerun affected papers and document the amendment.

The fourth is false-positive workload. Full-text sensitivity of 99.5% sounds almost complete, but specificity of 70.1% means many irrelevant papers may still reach adjudication. This is a reasonable trade for safety, yet it shifts labour rather than removing it. Review managers should measure hours saved after conflict resolution, not only the number of automated decisions.

The fifth is extraction brittleness. Custom columns that combine several facts, require arithmetic, depend on supplementary appendices or demand causal interpretation are more error-prone than direct fields such as sample size or intervention duration. Split compound questions into atomic columns. Preserve units. Add an explicit ‘not reported’ state so absence is not confused with zero.

Finally, synthesis remains limited. Elicit can produce reports, but it does not automatically deliver a defensible meta-analysis, certainty-of-evidence judgement, risk-of-bias assessment or nuanced cross-paper explanation of heterogeneity. A fluent narrative can still underweight contradictory findings. Human reviewers must inspect direction, magnitude, study quality and applicability before drawing conclusions.

Library, Writing and Citation Management Gaps

Elicit’s Library resolves the old criticism that work disappears after a search. Researchers can retain and organise sources, reuse them in later projects and attach alerts to continuing topics. The platform can also export bibliographic files and reports. That is enough for project continuity, but it is not equivalent to a mature reference-management environment.

Zotero remains stronger for canonical metadata, duplicate management, collections, group libraries, annotations, citation-style language, Word and Google Docs plugins, and the final production of citations and bibliographies. A robust workflow uses Zotero as the durable source library and Elicit as a computational review layer. Import records into Elicit, perform screening and extraction, then return verified references and notes to the writing environment.

Writing is the other boundary. Elicit Reports can generate a structured evidence brief, and the API can export a DOCX, but the product is not primarily a manuscript co-author. It does not replace the work of framing an argument, integrating theory, explaining methods, reporting deviations, interpreting bias or matching a journal’s rhetorical conventions. Our AI summariser tool guide draws a useful line between extracting a source’s content and constructing an original scholarly claim.

This is where general assistants and specialist writing platforms can complement Elicit. Claude can help critique logic and organise a draft. ThesisAI and PapersFlow advertise integrated writing and citation workflows. PapersFlow also claims a 474 million-item index, library management, LaTeX writing and counter-evidence discovery. ThesisAI advertises long-form scientific documents with LaTeX, Overleaf and Zotero integrations. However, I could not verify a complete current commercial pricing matrix for either product from an accessible official pricing page, so no exact price comparison is presented here.

The governance rule is simple: the system that stores your authoritative bibliography should not be the same place where temporary AI output is accepted without checking. Keep references, PDFs and final annotations under researcher control, and treat generated prose as a draft that must be traced back to original evidence.

Elicit vs Consensus, PapersFlow, ThesisAI and Zotero

Elicit and Consensus overlap in academic search, but their priorities differ. Consensus is optimised for fast evidence-grounded answers, snapshots and deep reviews. Elicit is stronger for procedural work: eligibility criteria, large-set screening, custom extraction and an audit trail. Consensus officially lists Free, Pro at $15 monthly or $120 yearly, and Deep at $65 monthly or $540 yearly.

PapersFlow presents an all-in-one workspace with search, multi-agent research, citation checking, counter-evidence, a library, workflow automation, LaTeX and presentation tools. That breadth addresses Elicit’s writing gaps, but its public claims need independent testing and a transparent price quote before it replaces a validated review process.

ThesisAI focuses on long-form academic generation. It may accelerate drafting, but a generated document is not a reproducible review. Researchers still need traceable citations, documented exclusions, treatment of contradictory studies and rerunnable methods. Elicit is the safer evidence layer; writing assistance can follow.

Zotero is Elicit’s natural partner, not a direct rival. It collects, organises, annotates, cites and shares research. Elicit imports from Zotero, but Zotero remains the stronger durable repository and citation engine.

Perplexity remains useful for current policy, company announcements and web sources beyond scholarly indexes. Our guide to Perplexity AI alternatives shows why a stack often beats a winner-takes-all choice: Elicit for systematic evidence, Zotero for references, Consensus for quick orientation, Perplexity for the live web and Claude for drafting or critique.

Choose by deliverable: Elicit for auditable review tables, Consensus for fast evidence answers, PapersFlow for an integrated workspace subject to validation, ThesisAI for assisted drafting and Zotero for bibliographic control.

Tool	Primary strength	Main limitation in this workflow	Verified 2026 pricing note
Elicit	Systematic screening, extraction and traceability	Not a full citation manager or complete meta-analysis engine	Free; Plus, Pro, Scale and Enterprise with selector-dependent rates.
Consensus	Fast synthesis of peer-reviewed evidence	Less procedural control for custom large-scale screening and extraction	$0; Pro $15 monthly or $120/year; Deep $65 monthly or $540/year.
PapersFlow	Integrated search, library, LaTeX writing and counter-evidence	Independent validation and complete official pricing were not verified	Official site offers a 7-day trial; exact matrix not verified.
ThesisAI	Long-form academic drafting and integrations	Generation is not a substitute for reproducible evidence synthesis	Exact current official pricing not verified.
Zotero	Reference management, annotation and citation production	No native systematic-review automation	Core software is free; storage plans are separate.

Can Elicit Analyse Non-Scientific Documents?

Elicit is designed for scientific and academic evidence, not general document intelligence. Users can upload papers for custom extraction and chat, and Enterprise can integrate custom data sources, but the product’s public validation, filters and metadata model are oriented towards research papers, study designs and clinical trials. That makes it a poor default for contracts, court judgments, financial filings, interview transcripts or large internal knowledge bases.

A non-scientific use case can still work when the documents resemble research articles and the questions are extractive. For example, a policy team might upload evaluation reports and ask for programme population, intervention, geography, method and outcome. The safer approach is to create atomic fields, test a representative sample and calculate error rates against human coding before scaling.

Problems arise when meaning depends on document hierarchy, annexes, legal definitions, cross-references, handwritten material or implicit institutional context. Elicit may return a plausible field while missing a qualification several pages away. It also lacks the document-level permissions, retention controls and connectors expected of a general enterprise knowledge platform unless those are negotiated through Enterprise.

For broad web and policy research, a cited answer engine may be more flexible. For deep qualitative synthesis, dedicated computer-assisted qualitative data-analysis software or a controlled retrieval system may be better. For contracts and regulated records, use a domain-specific platform with suitable security and evaluation evidence.

The decision rule is not ‘Can Elicit read this PDF?’ It is ‘Does Elicit’s workflow, data model and validation evidence match the judgement I am asking it to make?’ A successful demo is not proof of reliability. For consequential use, define a labelled test set, measure field-level precision and recall, record failure classes, and require human approval for every conclusion.

Auditability, Responsible Use and Expert Views

Elicit’s strongest strategic advantage is not that it automates every stage. It is that many decisions remain inspectable. Search strategies, eligibility criteria, exclusion reasons, supporting quotations, extraction sources and exports can be retained. This aligns with the 2025 joint position statement from Cochrane, the Campbell Collaboration, JBI and the Collaboration for Environmental Evidence, which says AI may be used only when methodological rigour and integrity are not compromised and human oversight is maintained.

The expert language around 2026 evidence synthesis is notably cautious. Artur Nowak, co-founder of Evidence Prime, said that “AI agents are rapidly reshaping workflows” when Cochrane selected tools for a platform study. Kevin Kallmes, CEO and co-founder of Nested Knowledge, emphasised “trusted, timely, evidence-driven decisions”. Neither statement amounts to endorsement of unsupervised review automation; both frame AI as something to evaluate within established methodological leadership.

Elicit’s own users make the same point from practice. Farhad Shokraneh, an SLR methodologist, highlighted a “traceable, explainable, and auditable Data Extraction module”. Heather Richbourg, a bioinformatics scientist at Ultragenyx, said the workflow “dramatically reduces the time to get a good overview”. The useful part of both testimonials is not speed alone. It is speed combined with a visible path back to evidence.

Andreas Stuhlmüller, Elicit’s co-founder and chief executive, described the company’s direction as reducing “hard-to-verify tasks” into easier-to-check processes. That philosophy appears in decomposition, provenance, consistency checks, explicit stages and human-review URLs. It also explains why Elicit is more convincing as a supervised system than as an autonomous reviewer.

A 2026 peer-reviewed scoping review identified 388 AI tools and platforms across 137 studies, while warning about automation bias and lack of true semantic understanding. The field is therefore crowded but not settled. A responsible buyer should demand public validation, domain boundaries, transparent metrics, error analysis, data-governance terms and a way to reproduce decisions.

For publication, report the tool name, access date, plan or version, stages used, human-review process, conflict handling, search strategies, prompts or criteria, and any deviations. Do not write that AI ‘completed’ the review when humans designed, checked and approved it.

What Elicit Still Needs to Improve

The first priority is broader and independently replicated validation. The Cochrane benchmark is substantial, but it is medical, open-access and partly constrained by identifier resolution and PDF availability. Public tests in social science, education, environmental evidence, engineering and qualitative research would show where performance transfers and where it falls.

The second is clearer product and pricing documentation. The selector-dependent pricing page exposes annual and monthly values that can look contradictory in static output. Product announcements and API documentation also differ on which systematic-review capabilities are available through the API. A single machine-readable entitlement table would reduce procurement risk.

The third is deeper reference-management integration. Zotero import is useful, but researchers need reliable bidirectional synchronisation, duplicate handling, stable collections, annotation transfer and predictable re-import behaviour. Until those details are explicit, Zotero should remain the system of record.

The fourth is synthesis beyond extraction. Elicit can generate reports, yet formal reviews often require risk-of-bias tools, certainty frameworks such as GRADE, effect-size calculation, heterogeneity analysis, sensitivity analysis and transparent treatment of conflicting evidence. Integrations or structured exports into statistical and review software would close more of the end-to-end gap than another generic writing feature.

The fifth is better error telemetry. Teams need dashboards showing missing full texts, unreadable pages, low-confidence extractions, criteria with high disagreement, papers affected by a protocol amendment and fields with frequent ‘not reported’ values. Those signals would help reviewers allocate human attention where it has the highest marginal value.

Future updates are likely to deepen API automation, custom data integration, collaboration and auditable templates because Elicit targets high-stakes research. No public roadmap I could verify promises full Zotero synchronisation, broad non-scientific document support or autonomous meta-analysis. Independent validation and clearer change logs will matter as much as new features; these capabilities remain open questions, not imminent promises.

Takeaways

Treat Elicit’s 95%, 97%, 99% and 96% figures as stage-specific metrics, not one universal accuracy score.
Use Elicit when the expensive work is screening and structured extraction; use a writing assistant when the expensive work is drafting and argument development.
Keep Zotero or another reference manager as the authoritative library, even though Elicit now offers Library, alerts and Zotero import.
Pilot eligibility criteria and extraction columns on a mixed sample before processing thousands of papers.
Expect high recall to preserve some irrelevant papers for human adjudication, particularly at full-text screening.
Archive API exports immediately because presigned file links expire after seven days.
Verify plan entitlements at checkout, especially annual versus monthly pricing and Enterprise-only systematic-review API access.
Document every AI-assisted judgement, human check and protocol change if the review will inform publication, policy, clinical or commercial decisions.

Conclusion

Elicit enters the second half of 2026 as one of the most credible AI products for systematic evidence work. It combines a large academic corpus, semantic and keyword search, PRISMA-oriented screening, source-linked extraction, reports, a library, alerts, exports and programmatic access. More importantly, it publishes enough evaluation detail to distinguish recall from specificity and to expose the limits of its own benchmarks.

That transparency supports a favourable verdict, but not an unconditional one. Elicit can save substantial time when a protocol is clear and the evidence resembles the domains it has been tested on. It cannot guarantee complete database coverage, repair an incoherent review question, judge bias without a method, or turn extracted fields into a defensible causal conclusion. Its lower full-text specificity, dependence on accessible PDFs and incomplete replacement of citation and writing tools remain meaningful constraints.

The best 2026 workflow is therefore modular. Use Elicit to discover, screen and extract; use domain databases and librarians to protect coverage; use Zotero to govern references; use statistical and risk-of-bias tools for formal synthesis; and use a writing assistant only after the evidence table is verified. Open questions remain around cross-domain validation, bidirectional library integration, independent benchmarking and the boundary between supervised acceleration and autonomous review.

Frequently Asked Questions

Is Elicit AI worth it in 2026?

Yes for researchers who repeatedly screen and extract data from large paper sets. Basic is strong for discovery, while Pro adds the dedicated systematic-review workflow. It is less valuable for users whose main need is manuscript writing or citation management.

How accurate is Elicit AI?

Elicit reported 95.0% search recall, 96.9% abstract-screening sensitivity, 99.5% full-text sensitivity and 95.6% correctness on selected extraction tasks. These metrics used different datasets and should not be combined into one universal accuracy percentage.

Can Elicit conduct a complete systematic review?

It can support search, screening, extraction and report generation, with PRISMA-oriented outputs. Humans still need to design the protocol, resolve conflicts, assess bias, verify extraction, conduct statistical synthesis and approve conclusions.

Does Elicit integrate with Zotero?

Elicit’s current pricing page lists Zotero import. Public documentation reviewed for this article did not verify full bidirectional synchronisation, so Zotero should remain the authoritative library unless your account documentation states otherwise.

Can Elicit write a literature review?

Elicit Reports can generate a structured research brief with citations, but the platform is not a complete academic writing environment. Researchers still need to develop the argument, explain methods, assess evidence quality and adapt the manuscript to publication requirements.

Can Elicit analyse non-scientific documents?

It can extract information from uploaded documents, especially when they resemble research reports. Its public validation and metadata are scientific, so contracts, legal records, transcripts and internal corpora require separate testing and may be better handled by domain-specific tools.

What is the best Elicit alternative?

Consensus is strong for rapid evidence answers, PapersFlow targets an integrated research-and-writing workspace, and specialist review platforms offer different governance controls. Zotero is the best complement for references rather than a direct substitute.

What plan is needed for the Elicit API?

The official API reference says Pro or higher is required for search and reports. It also states that systematic reviews through the API are Enterprise-only. Confirm current entitlements because web-workflow access and API access are not identical.

References

Cochrane. (2026, March 17). Cochrane announces selected AI tools for innovative platform study. https://www.cochrane.org/about-us/news/cochrane-announces-selected-ai-tools-innovative-platform-study

Consensus. (2026, April 30). Subscription plans. The Consensus Help Center. https://help.consensus.app/en/articles/10087865-subscription-plans

Elicit. (2026). Elicit API reference (OpenAPI version 3.1.0; API version 0.0.1). https://docs.elicit.com/

Elicit. (2026). Pricing. https://elicit.com/pricing

Elicit. (2026, May 6). Elicit systematic review: Now built for PRISMA 2020. https://elicit.com/blog/systematic-review-for-prisma-2020

Flemyng, E., Noel-Storr, A., Macura, B., Gartlehner, G., Thomas, J., Meerpohl, J. J., Jordan, Z., Minx, J., Eisele-Metzger, A., Hamel, C., Jemioło, P., Porritt, K., & Grainger, M. (2025). Position statement on artificial intelligence use in evidence synthesis. Environmental Evidence, 14, Article 20. https://doi.org/10.1186/s13750-025-00374-5

Prasad, P. (2026, May 6). Evaluating Elicit’s systematic literature review capabilities. Elicit. https://elicit.com/blog/evaluating-elicit-slr

Sousa, M. S. A., Peiris, S., Figueiró, M. F., Haby, M. M., Baraldi, A. C., Reveiz, L., & Souza, J. P. (2026). The landscape of artificial intelligence tools and platforms for evidence synthesis: A scoping review. Systematic Reviews, 15, Article 82. https://doi.org/10.1186/s13643-025-02842-y

Zotero. (2026). Your personal research assistant. https://www.zotero.org/

Elicit AI Review 2026: Rigour at Scale, Gaps Exposed