5 AI Tools for Systematic Review Without Losing Rigour

Sami Ullah Khan

June 17, 2026

AI Tools for Systematic Review

A systematic review can lose more rigour in one opaque exclusion decision than it gains from days of saved labour. I approached this comparison of ai tools for systematic review with that tension in mind: the useful question is not which platform promises the most automation, but which one makes its assistance inspectable, reversible, and reportable. This article explains how Rayyan, Scholara, Covidence, Elicit, and ASReview handle literature discovery, deduplication, title and abstract screening, full-text review, data extraction, risk-of-bias work, and meta-analysis. It also shows what each product costs, where plan caps appear, which integrations are documented, and how to preserve PRISMA compliance.

The five tools are not interchangeable. Rayyan is strongest as a collaborative screening environment with AI assistance layered over reviewer decisions. Scholara is the most ambitious single-workspace proposition, extending from PICO and search construction to extraction, RoB 2 assessment, and forest plots. Covidence remains the governance-heavy choice for teams that need a mature review workflow, clear role separation, and institution-level administration. Elicit is particularly effective for literature discovery, evidence tables, summaries, and structured extraction. ASReview is the open-source active-learning option for teams willing to manage Python, model choices, and validation themselves.

This is a documentation-led 2026 evaluation, not a controlled head-to-head product trial. I therefore treat vendor time savings, accuracy claims, installation counts, and user totals as claims rather than independent findings. The practical recommendation is consistent throughout: use AI to prioritise, propose, and structure; keep humans responsible for eligibility, critical extraction fields, risk-of-bias judgements, and synthesis. That distinction matters because screening assistance can accelerate a review without proving that no relevant study was missed.

AI Tools for Systematic Review: What Actually Changes

Traditional systematic reviewing divides labour across databases, reference managers, spreadsheets, screening platforms, statistical packages, and manuscript files. AI tools for systematic review compress that stack, but they do so in different ways. Some rank records by predicted relevance. Others generate eligibility rationales, pre-populate data fields, or transform extracted values into analysis tables. The productivity gain comes from reducing repeated navigation and manual transcription, not from transferring scientific accountability to a model.

The first change is ordering. Active-learning systems such as ASReview and Covidence relevance sorting alter which records a reviewer sees first. This can bring likely inclusions to the top, shorten the time to evidence discovery, and make early protocol problems visible. It does not automatically reduce the number of records that must be screened unless the team adopts a documented stopping rule and validates the excluded tail. A ranking model can be valuable even when every record is eventually assessed.

The second change is decomposition. Elicit, Rayyan ResearchPilot, and Scholara break broad tasks into smaller operations: identify PICO elements, locate supporting text, propose an inclusion decision, extract a number, and attach provenance. This is safer than asking a general chatbot to “do the review” because each output can be checked against a record. Readers considering adjacent academic workflows can compare the wider landscape of AI research tools for 2026, but systematic review software should be judged against protocol fidelity, audit trails, and export quality rather than conversational fluency.

The third change is the shape of quality control. Duplicate removal, conflict resolution, blinded screening, versioned criteria, and extraction verification become software configuration questions. A robust workflow records when AI was enabled, which model or feature was used, whether rankings changed after labels accumulated, who resolved conflicts, and how potentially missed studies were sampled. The 2025 joint position statement from Cochrane, Campbell, JBI, and the Collaboration for Environmental Evidence permits AI use only with human oversight, methodological integrity, and transparent reporting. Those conditions should be treated as design requirements, not final-stage disclosures.

Comparison at a Glance

The table below reflects documented capabilities available in June 2026. “Best for” describes workflow fit, not a universal quality ranking. Organisations should also review data processing terms, retention policies, regional hosting, and procurement requirements before uploading unpublished manuscripts or sensitive review material.

ToolBest forDocumented AI featuresIntegrations and technical fitCommercial model
RayyanCollaborative screening, deduplication, review coordinationAI relevance ratings; AI Analyzer; AI Reviewer; auto-extraction; PICO support; conflict and blind modes; sampling; PRISMA and audit featuresRIS, BibTeX, EndNote and common citation imports/exports; enterprise API documented; web and mobile workflowsFreemium; paid individual, academic, business, and enterprise tiers
ScholaraClosest to an end-to-end AI workspacePICO and MeSH search building; triple-agent consensus screening; rationales; full-text screening; structured extraction; RoB 2; forest plots; subgroup and sensitivity views; PRISMA logsPubMed, ClinicalTrials.gov, Europe PMC; imports from Embase, Cochrane and common formats; CSV and broader exports by plan; SSO/SAML on enterpriseFree evaluation tier; Starter, Pro, Team, Enterprise
CovidenceLarge teams, governance, duplicate screening, extractionRelevance sorting; automatic deduplication; RCT tagging; extraction suggestions; conflict resolution; PRISMA flow; administrative oversightCitation file imports and reference-manager interoperability; exports for analysis and reporting; no broad public automation API verifiedAnnual single review, three-review package, organisation pricing
ElicitDiscovery, summarisation, evidence tables, structured extractionSemantic search; screening; research reports; custom columns; supporting quotes; extraction explanations; alerts; figure extraction on higher tiersZotero import; RIS, CSV, BIB, PDF and DOCX exports by plan; API on Pro and above; enterprise custom data and search APIFree Basic; Plus, Pro, Scale, Enterprise; multiple audience variants shown
ASReviewOpen-source title and abstract prioritisationActive learning; model switching; simulation; performance testing; parallel or team deployment; multilingual and heavier model options; duplicate hiding; editable tagsPython package; local browser app; server and Docker deployment; extension ecosystem; project data retained locally unless centrally hostedOpen-source software; infrastructure and administration are separate costs

Sources: official vendor product, pricing, help, and installation pages accessed 16 June 2026. Feature availability varies by plan and institutional contract.

No single product wins every stage. A team may use Elicit for early discovery, Rayyan or Covidence for blinded screening, a structured extraction environment for data collection, and R or dedicated meta-analysis software for final synthesis. That mixed stack is not a failure of integration. It can be a control mechanism, because moving data between stages creates explicit validation checkpoints. For broader study workflows, the site’s AI tools for students comparison provides context, but a review team should prioritise reproducibility over the number of features in one subscription.

Rayyan: Best for Collaborative Screening

Where Rayyan is strongest

Rayyan remains the clearest choice when the main bottleneck is coordinated title and abstract screening. It supports blinded decisions, labels, exclusion reasons, conflict resolution, reviewer invitations, filtering, random samples, duplicate handling, and mobile work. Its current product page describes ResearchPilot functions that include zero-shot relevance ratings, AI Analyzer, AI Reviewer, and Auto-Extract Data. The company also documents deduplication for up to 200,000 references and says its platform has supported more than three million reviews across more than 190 countries. Those scale figures are vendor-reported, but they signal an established collaboration model rather than a newly assembled chatbot interface.

“democratizing access to an essential tool for science” Robert Ayan, CEO of Rayyan, on the company’s 2026 product page.

The operational advantage is that AI assistance sits inside a familiar review process. Teams can begin with manual blind screening, inspect conflicts, and then add prioritisation or AI rationales without redesigning the protocol. ResearchPilot’s AI Reviewer can act as an additional opinion, while AI Analyzer and relevance ratings can help triage large imports. The responsible pattern is to treat these outputs as recommendations. Reviewers should still apply pre-specified eligibility criteria, record exclusions, and adjudicate ambiguous abstracts.

Constraints, plan boundaries, and API access

The free plan is useful but capped at three active reviews and two free reviewers. Paid individual tiers add capacity and advanced AI functions, while institutional plans add larger reviewer allowances, management, SSO or SAML, and API access. ResearchPilot availability can depend on an institutional contract, so a team should not assume that every feature shown in help documentation is included in a personal subscription. The product’s enterprise API is documented, but public materials do not expose a complete endpoint-by-endpoint rate-limit matrix. Procurement teams should ask about authentication, export schemas, batch limits, data residency, retention, and whether API actions appear in the review audit log.

The most important edge case is duplicate leakage. If the same trial appears as a conference abstract, registry record, preprint, and journal article, automatic deduplication may either keep too much or collapse legitimately distinct reports. Resolve duplicates before creating validation samples, but preserve links between multiple reports of one study. Otherwise, an apparent recall estimate can be inflated because near-identical records appear in both the training and validation sets. Researchers using general-purpose assistants alongside Rayyan should also distinguish screening evidence from narrative synthesis; the practical differences are explored in the site’s Claude AI research workflow.

Scholara: Closest to an End-to-End Review Workspace

Full pipeline capabilities

Scholara is the broadest end-to-end proposition in this group. Its documented workflow begins with PICO extraction and MeSH-assisted search construction, searches sources including PubMed, ClinicalTrials.gov, and Europe PMC, then creates unified study records for screening, PDFs, extraction, risk of bias, and analysis. Three AI agents independently vote on inclusion, with disagreements flagged for human review. The system exposes rationales and confidence, and its reporting layer rolls search logs, screening decisions, and exclusions into a live PRISMA 2020 flow. On the analysis side, Scholara advertises interactive forest plots, subgroup analyses, sensitivity diagrams, RoB 2 assessment, and statistical meta-analysis.

That integrated design removes hand-offs, but it also concentrates risk. A protocol change can propagate through search, screening, extraction, and analysis. Teams should therefore freeze a protocol version before bulk screening, record amendments as amendments, and export snapshots after each stage. A unified study record is especially useful when one study has several publications, but reviewers must confirm whether extracted outcomes belong to the correct report, time point, population, and analysis set.

“achieving >90% in most evaluation metrics” Dr Joe Cutteridge, doctor and researcher, in a vendor-hosted Scholara testimonial that also cites a 180x time reduction.

The quotation is promising but not an independent benchmark. Scholara’s page does not publish the evaluation dataset, class balance, review domains, confidence intervals, or whether the metric refers to precision, recall, agreement, or extraction accuracy. A defensible buyer evaluation should replay a completed review, conceal the final inclusion set, and calculate study-level recall rather than accepting a single aggregate percentage.

Pricing contradictions and security wording

Scholara’s plan cards and its FAQ currently display conflicting usage limits. The cards list 300 abstract screens per day and 2,500 per month on Free, 750 per day and 10,000 per month on Starter, 2,500 per day on Pro, and 60 full-text screens plus 60 extractions per seat per day on Team. The FAQ instead refers to roughly 100 abstracts and 10 PDFs per day on Free, 500 abstracts on Starter, 2,000 on Pro, and 50 full-text screens on Team. The correct operational cap should be confirmed in checkout or contract language before budgeting a large review.

Enterprise materials list SSO or SAML and SOC 2 and ISO 27001, while the separate security page has described SOC 2 Type II as becoming available after certification. That is not enough evidence to claim completed certification. Buyers should request the current audit report or certificate, subprocessor list, regional hosting details, deletion procedure, and incident-notification terms. For researchers designing a complete doctoral workflow, the site’s PhD research guide with AI offers adjacent planning context, but the review protocol must remain the controlling document.

Covidence: Team Governance and Responsible Automation

Why large teams still choose it

Covidence is less focused on presenting itself as an autonomous reviewer and more focused on managing a methodologically recognisable review. It supports citation import, automatic duplicate detection, title and abstract screening, full-text assessment, exclusion reasons, conflicts, extraction templates, risk-of-bias work, PRISMA flow reporting, and institutional administration. Unlimited collaborators on review packages are valuable when clinicians, librarians, statisticians, and subject experts work across organisations.

Its AI features include relevance-based sorting, automated RCT tagging, and extraction suggestions. Import handling has practical limits: Covidence allows an unlimited number of references overall, but its support documentation instructs users to split import files into 15,000 records or fewer, keep each file below 50 MB, and import one file at a time. Those are exactly the operational details that determine whether a tool works for an update with 80,000 records. Automatic deduplication occurs during import, so teams should archive original search exports and log the number imported from each source before deduplication.

A useful 2026 lesson about average recall

Covidence published an unusually informative account of a model it chose not to release. An automated ineligibility feature achieved average recall of 98.6%, above the team’s 98% threshold, but performance across individual reviews ranged from 65.7% to 100%. Because a false exclusion could compromise conclusions and humans would only audit the output, the company withheld the feature. In another lower-risk extraction use case, a sponsorship-source suggestion model reached 92.2% precision and 100% recall on 107 representative studies and was released with reviewers required to accept or reject each suggestion.

“Knowing how a model performs is not the same as knowing whether it should be used.” Covidence product team, 2026 responsible automation series.

This distinction is central to choosing ai tools for systematic review. Mean accuracy is not enough. Review-level minima, confidence intervals, domain shift, human oversight, and the impact of error determine whether automation is appropriate. Covidence’s approach also clarifies why a model can be safe for sorting, where humans still decide, but unsafe for automatic exclusion, where a missed study may never be seen.

The main commercial constraint is cost per review rather than per seat. A single review is $339 for 12 months, a package of up to three reviews is $907, and organisation-wide access is custom priced. No broad public automation API was verified in the current documentation. Integration is therefore primarily through reference files, exports, and established workflow boundaries. Teams comparing search tools should read the site’s Perplexity versus Google Scholar analysis, then keep discovery tools separate from the final reproducible database search.

Elicit: Discovery, Extraction, and Research Briefs

Where Elicit saves the most work

Elicit is most useful before and around formal screening: exploring terminology, finding seed papers, building evidence tables, summarising papers, asking structured questions, and extracting fields with supporting quotations. Its current pricing page advertises search across more than 138 million papers, unlimited search on the Basic tier, research reports, custom extraction columns, explanations, alerts, figure extraction on higher plans, and systematic-review projects that can handle thousands of papers. The current homepage states that more than five million people use the service, replacing the older two-million figure still repeated in some articles.

For a systematic review, the key feature is not the generated summary. It is provenance. A proposed value should link to the source passage, page, table, or figure so a reviewer can verify it. This is particularly important for denominators, adjusted versus unadjusted estimates, intention-to-treat populations, subgroup outcomes, and units. An AI extraction system can identify a plausible number while attaching it to the wrong arm or time point. A two-column evidence table should therefore include the extracted value and an independent verification status.

“AI has an extremely jagged capabilities profile.” Andreas Stuhlmüller, Elicit co-founder and chief executive, April 2026.

Stuhlmüller’s point explains why Elicit can be excellent at decomposed retrieval tasks and still require caution in final synthesis. Its product design increasingly reduces hard-to-verify outputs into smaller claims with provenance, consistency checks, and process supervision. That architecture is methodologically preferable to a single fluent answer. The site’s AI summariser tool guide explains the broader summarisation market, but systematic reviewers should avoid replacing extraction forms with unstructured summaries.

Plan variants, API, and export constraints

The researcher-facing annual pricing block currently shows Basic at $0, Plus at $7 per month billed annually, Pro at $29 per user per month billed annually, and Scale at $49 per user per month billed annually. Pro lists systematic reviews up to 5,000 papers, 144 reports per year, 20 columns, 135 report sources, ten alerts, custom extraction, explanations, and API access. Scale lists 240 reports per year, 200 sources per report, 30 columns, figure extraction, and collaboration. Enterprise adds up to 40,000 papers, 40 columns, SSO or SAML, two-factor authentication, domain verification, single tenancy, custom data, and an unlimited Search API.

However, the same public pricing page exposes other audience or industry blocks with higher Pro and Scale prices. A buyer should confirm which tab, billing term, and usage schedule applies to the account. Exports include RIS, CSV, BIB, PDF, and DOCX on applicable plans, and Zotero import is documented. API access begins on Pro in the visible research block, but complete rate limits and endpoint quotas are not published in the captured pricing material. The safest implementation is to test round-trip fidelity: export a small project, verify identifiers and multiline fields, and confirm that provenance links survive the chosen format.

ASReview: Open-Source Active Learning

Technical setup and model behaviour

ASReview is the strongest choice when a team wants transparent, open-source prioritisation and has the technical capacity to operate it. ASReview LAB 3.0.7 requires Python 3.10 or later, installs with pip, and launches a local browser interface. It can also run through server or container deployments. The project supports active learning, simulations, model performance testing, model switching, custom models, multilingual options, heavier models, editable tags, and duplicate hiding. Local operation can keep project data on the researcher’s machine, while central deployment requires the team to manage authentication, backups, updates, logs, and access controls.

The core loop is simple: reviewers label a small set of records, the model ranks the remaining records, reviewers screen the next batch, and the ranking updates. The person making labels is called the oracle in the documentation. That label stream is the model’s training signal, so uncertain decisions should be resolved before they are committed. If eligibility criteria change halfway through, the model has learned from an inconsistent target. The correct response is not merely to continue screening. The team should document the amendment, review prior labels affected by it, retrain or restart as appropriate, and revalidate performance.

“AI can’t decide which problems are worth solving.” Lisa Su, AMD chair and chief executive, speaking to MIT graduates in reporting published 11 June 2026.

The quotation is broader than evidence synthesis, yet it captures ASReview’s operating boundary. The software can optimise the order of records against labels, but it cannot decide whether the protocol asks the right causal question, whether an outcome is clinically meaningful, or whether a borderline study should reshape the framework.

Stopping rules and hidden costs

ASReview’s website says active learning can reduce screening workload by 95% and reports more than 638,000 installations. Those are vendor figures, not a guarantee for a particular prevalence, field, language, or search strategy. Workload reduction depends on how quickly relevant studies cluster, the quality of seed labels, the selected model, and the stopping criterion. Reviews with very low prevalence or heterogeneous terminology can be harder because a relevant study may sit deep in the ranked tail.

There is no licence fee for the open-source software, but there are real costs: Python support, version pinning, validation, server hosting, security, user administration, backup, and methods expertise. Heavier or multilingual models may improve fit but increase compute time. Parallel screening also needs careful design, because labels arriving from several reviewers can change the ranking while others are working. Freeze model versions and random seeds for simulations, export project state, and record the exact stopping rule. Researchers who use general discovery engines to seed ASReview should also understand how Perplexity accuracy is measured, because citation retrieval quality and screening recall are separate problems.

Pricing and Technical Integration Matrix

The matrix below consolidates visible commercial plans and technical limits. It is deliberately specific because “freemium” can conceal the limits that matter most: active reviews, daily screens, report counts, full-text quotas, reviewer allowances, API access, and export formats. Annualised per-month prices require annual payment where stated.

PlanCurrent priceLimits and hidden capsIntegration and entitlement notes
Rayyan Free$03 active reviews; 2 free reviewers; limited mobile; duplicate detection and AI relevance includedCommon citation imports/exports; no enterprise API entitlement shown
Rayyan Essential annual$4.99 per seat/monthAnnual billing; individual capacity above Free, exact active-review cap not clearly exposed in captured plan textCollaboration and review features; confirm ResearchPilot entitlement
Rayyan Essential quarterly$8.33 per seat/monthQuarterly billingSame tier family; verify current feature schedule at checkout
Rayyan Advanced annual$8.33 per seat/month9 active reviews; up to 10 free reviewers; unlimited samples; PICO AI; priority supportAI agents and expanded workflow features
Rayyan Advanced quarterly$13.33 per seat/monthQuarterly billing; same plan familyVerify AI and reviewer allowances at checkout
Rayyan Business/AcademicCustomMinimum 5 licences; annual; 50 organisation-wide reviews; 250 free reviewersResearchPilot, organisation controls, institutional support
Rayyan EnterpriseCustomUnlimited reviews and viewers; contract-defined limitsSSO/SAML, API, management console, dedicated support
Scholara Free$0Plan card: 300 abstracts/day and 2,500/month. FAQ: about 100 abstracts and 10 PDFs/24hPubMed, ClinicalTrials.gov, Europe PMC; PICO/MeSH; AI screening
Scholara Starter$19/monthPlan card: 7,500 studies/search, 750 screens/day, 10,000/month. FAQ says about 500/dayCSV export; unlimited active reviews
Scholara Pro$49/monthUnlimited studies/search; 2,500 screens/day. FAQ says about 2,000/dayAll exports; protocol and study import; priority queue/support
Scholara Team$99/seat/month2,500 abstracts, 60 full texts, and 60 extractions per seat/day. FAQ cites lower figuresShared workspace and data privacy features
Scholara EnterpriseCustomCustom usage and seatsSSO/SAML; vendor lists SOC 2 and ISO 27001, but request current certificates
Covidence Single$339/yearOne review for 12 months; unlimited collaboratorsImport files at most 15,000 records and 50 MB each; one file at a time
Covidence Package$907/yearUp to 3 reviews for 12 months; unlimited collaboratorsSame review workflow and import constraints
Covidence OrganisationCustomUnlimited reviews, users, collaborators, and supportAdministrator visibility, training, institutional management
Elicit Basic$02 reports/month; limited Research Agent; unlimited search; 2 extraction columnsZotero import; basic search and summaries
Elicit Plus$7/month billed $84 annually4 reports/month; 5 columns; clinical trials searchRIS, CSV, BIB, PDF, and DOCX exports
Elicit Pro$29/user/month billed $348 annuallyUp to 5,000 papers; 144 reports/year; 20 columns; 135 sources/report; 10 alertsCustom extraction, explanations, API access
Elicit Scale$49/user/month billed $588 annually240 reports/year; 200 sources/report; 30 columnsFigure extraction, collaboration, administration
Elicit EnterpriseCustomUp to 40,000 papers and 40 columns; contract-defined capacitySSO/SAML, 2FA, domain verification, single tenancy, custom data, unlimited Search API
ASReview LABOpen-sourceNo commercial screen cap; performance depends on hardware, model, corpus, and deploymentPython 3.10+; local app, server, Docker, custom models and extensions

Pricing checked 16 June 2026. Rayyan and Elicit display billing variants; Scholara displays conflicting caps in plan cards and FAQ copy. Contract and checkout terms should control purchasing decisions.

Two bottlenecks deserve more attention than feature lists. First, export fidelity can fail quietly: special characters, multiline extraction fields, duplicate identifiers, and linked reports may not survive a CSV round trip. Test exports before the review becomes large. Second, APIs do not automatically create reproducibility. A scripted integration should store request parameters, response versions, timestamps, model identifiers where available, and error logs. The site’s overview of data analysis tools for research can help with the downstream stack, but the evidence table should remain traceable back to each study report.

Workflow Checklist From PICO to Meta-Analysis

AI tools for systematic review selection by stage

A defensible workflow maps each task to the tool that reduces labour without hiding a critical judgement. The table provides a practical allocation. It assumes a review team has a registered or timestamped protocol, a reproducible database search, and at least two people available for high-impact decisions.

StageRequired outputBest-fit tool optionsHuman validation gate
1. Question and protocolDefine PICO/PICOS, outcomes, study designs, exclusions, synthesis plan, and AI policyScholara for PICO/MeSH drafting; Elicit for terminology explorationHuman methods lead approves every criterion and protocol amendment
2. Search strategyBuild database-specific syntax, peer review search, save dates and exact stringsScholara previews; Elicit or Perplexity for seed concepts onlyLibrarian validates syntax; never replace database searches with a chatbot search
3. Import and deduplicationArchive raw exports, import source by source, deduplicate, link multiple reportsRayyan or Covidence; Scholara unified recordsAudit random duplicate pairs and false merges before screening
4. Title and abstract screeningPilot criteria, blind screen, resolve conflicts, then enable prioritisationRayyan, Covidence, Scholara, or ASReviewMeasure reviewer agreement; retain manual authority over exclusions
5. Full-text assessmentRetrieve PDFs, record exclusion reason at study level, link reportsCovidence, Rayyan, or ScholaraHuman reviewers inspect full text; AI can suggest but not silently exclude
6. Data extraction and RoBCreate typed fields, pilot on diverse studies, extract with provenance, assess biasElicit, Scholara, Rayyan ResearchPilot, or Covidence suggestionsDouble-check critical outcomes, denominators, effect directions, and RoB judgements
7. Synthesis and meta-analysisValidate analysis-ready dataset, transform effects, inspect heterogeneity, run sensitivity analysesScholara forest plots for integrated work; R or specialist software for independent analysisStatistician verifies model, variance, zero-event handling, subgroup logic, and interpretation
8. Reporting and updateExport logs, complete PRISMA flow, disclose AI use, archive versions, plan update searchRayyan, Covidence, and Scholara reporting; Elicit alerts; ASReview project exportMethods section names tools, versions, dates, tasks, oversight, validation, and deviations

Workflow design synthesises PRISMA 2020, the 2025 joint AI position statement, official vendor documentation, and current evidence-synthesis practice.

Implementation should begin with a pilot, not a full import. Select 100 to 300 records that include obvious inclusions, obvious exclusions, and ambiguous cases. Have reviewers apply the criteria independently, reconcile differences, and refine operational definitions without changing the scientific question. Only then should the team configure labels, exclusion reasons, extraction schemas, and AI features.

Next, create a validation set that the model does not train on. For active learning, a random sample of the unscreened tail is more informative than checking only high-ranked records. For extraction, sample by document difficulty: scanned PDFs, tables, appendices, subgroup reports, and papers with multiple time points. For meta-analysis, independently recompute a subset of effect sizes and verify that direction, units, and variance agree.

Finally, freeze the evidence hand-off. Assign stable study IDs, preserve source identifiers, link all reports belonging to one study, and export a data dictionary. A PICO framework is useful for eligibility, but extraction needs a more granular schema for population variants, intervention dose, comparator definition, outcome instrument, follow-up, analysis population, and risk-of-bias domain. AI is most dependable when each field has a precise type and an explicit “not reported” state.

How to Keep PRISMA Compliance When AI Screens Studies

PRISMA compliance is not a button or a generated flow diagram. It is the traceable relationship between a protocol, searches, records, decisions, studies, analyses, and reported conclusions. AI tools for systematic review can support that chain, but they can also introduce undocumented steps. A generated diagram is only accurate when imports, deduplication, exclusions, and linked reports have been recorded correctly.

Document the automation intervention

The methods section should name the product, feature, version or access date, task, input, output, and human role. For screening, state whether AI sorted records, suggested decisions, acted as an additional reviewer, or automatically excluded anything. Report when the feature was enabled, how many human-labelled records existed at that point, whether rankings changed during screening, and who resolved conflicts. For extraction, describe the schema, whether source quotations were shown, which fields required double checking, and how disagreements were handled.

A useful AI disclosure includes the stopping rule and its validation. “Screening stopped when no relevant studies appeared in the previous 100 records” is not enough unless the rule was pre-specified and tested. Report any random sample of excluded records, recall estimate, confidence interval, or comparison with a known inclusion set. Where no validated stopping rule was used, say that all records received a human eligibility decision.

Preserve the record-to-study distinction

PRISMA 2020 distinguishes records, reports, and studies. AI deduplication frequently works at the citation level, while inclusion and synthesis operate at the study level. One clinical trial may have a registry entry, abstract, primary report, protocol, follow-up, and secondary analysis. If software treats those as independent studies, extraction and meta-analysis can double count participants. If it merges them too aggressively, reviewers may lose outcome detail. Maintain a study-family table that links every report to one study ID and records which report supports each extracted value.

The 2025 joint position statement also requires transparent reporting whenever AI makes or suggests judgements. That includes generated exclusion reasons, risk-of-bias proposals, and automated classifications. Keep raw model output where feasible, but do not flood the manuscript with logs. Archive the detailed audit trail in a repository or project record, then report the decision logic and validation succinctly. Journal policies are changing, so teams should check the target journal’s AI disclosure rules before submission.

Human Validation, Benchmarks, and Stopping Rules

Why headline accuracy is insufficient

The dominant benchmark mistake is to report a pooled mean that hides review-level variance. A screening system can achieve excellent average recall while failing on one review with uncommon terminology, non-English abstracts, older indexing, or a rare study design. Covidence’s withheld model demonstrates this directly: 98.6% average recall concealed a minimum of 65.7%. The correct evaluation unit is often the review, not the record, because the harm arises when a review misses evidence.

A second mistake is leakage. If a model has seen the same review, article, or duplicated record during development, benchmark performance may overstate real-world generalisation. A June 2026 SciConBench preprint evaluated eight frontier models and research agents on scientific conclusion synthesis and reported a best clean-room factual F1 of 0.337. The result is preliminary and concerns synthesis rather than screening, but it reinforces the need for controlled retrieval and independent test sets. The 2026 automated meta-analysis review similarly found that only one of 54 included studies explored preliminary full-process automation, while most work focused on data processing.

A validation matrix for real reviews

Use casePrimary failureMinimum evaluationHuman control
Relevance ranking onlyMissed prioritisation, not missed eligibility if all records are screenedTime to first inclusion, ranking gain, workload curveReviewers still decide every record
AI as second screenerSystematic disagreement or automation biasRecall against dual-human consensus; review-level minimum; conflict profileHuman adjudicates every AI-human disagreement
Automatic exclusionRelevant study never reaches a reviewerVery high recall with confidence intervals across diverse reviews; tail auditUse only with validated threshold and explicit governance, otherwise avoid
AI-assisted extractionWrong arm, time point, denominator, unit, or effect directionField-level precision/recall; critical-field error rate; provenance coverageHuman accepts or rejects each critical value with source passage visible
Automated meta-analysisWrong effect transformation, model choice, variance, or subgroup logicRecomputed effect sizes; agreement with independent script; sensitivity checksStatistician verifies data and analysis code before interpretation

The validation thresholds should be set by error impact, human oversight, and domain risk. No single percentage is safe for every review.

A good stopping rule combines several signals: a pre-specified number of consecutive exclusions, a model-based estimate of remaining inclusions, a random audit of low-ranked records, and sensitivity to different model settings. Simulation on a completed review can help choose the rule, but it cannot guarantee performance on a new domain. Report the rule before looking at the final result when possible.

Human validation also needs independence. A reviewer who sees an AI rationale before making a decision may anchor on it. For a validation sample, conceal the AI output until the human decision is recorded. For extraction, have one reviewer verify against the PDF rather than against the model’s wording. For synthesis, regenerate key calculations from the raw fields. These controls are slower than accepting outputs, but they concentrate effort where errors can change conclusions.

Which Tool Should You Choose?

Choose by bottleneck, not by feature count

Choose Rayyan when the review already has a search strategy and the largest problem is collaborative screening, conflict management, duplicate handling, and reviewer throughput. Its free tier is credible for smaller work, while advanced AI and organisational features require careful plan checking. It is the most natural upgrade from spreadsheets for teams that want to preserve human screening decisions.

Choose Scholara when the team values one connected workspace from PICO construction to forest plots and accepts a newer platform with commercial and documentation details that require procurement verification. It is the strongest candidate for reducing hand-offs, but its breadth makes independent validation essential. Resolve the current plan-limit contradictions and request security evidence before adopting it for confidential work.

Choose Covidence when governance, established team roles, institutional administration, unlimited collaborators, and methods-centred workflows matter more than experimental autonomy. It costs more for a single review than the entry tiers of other products, but its 2026 AI decision framework is a meaningful strength. Large health-review teams may value the product’s restraint as much as its automation.

Choose Elicit when the problem is finding literature, understanding a field, creating evidence tables, tracing claims to source passages, and accelerating structured extraction. It is particularly useful before formal database screening and during extraction design. Do not treat its search corpus as a replacement for bibliographic databases, and confirm the pricing variant, paper limits, report quotas, and API terms that apply to your account.

Choose ASReview when budget, transparency, local control, experimentation, or reproducible active-learning research matters, and the team can support Python and validation. It offers the greatest methodological freedom and the least turnkey governance. The open-source licence removes subscription cost, not the need for technical ownership.

For many reviews, the best answer is a controlled combination. Elicit can help discover terminology and prototype extraction fields; Rayyan or Covidence can manage blinded eligibility; ASReview can test prioritisation; Scholara can provide an integrated alternative; and R can independently reproduce the final analysis. The decision should be documented in the protocol as a map of tasks, controls, and hand-offs, not as a declaration that one product “did the systematic review”.

Takeaways

  • Rayyan is the strongest general choice for collaborative screening, but advanced AI, reviewer allowances, API access, and ResearchPilot entitlement depend on plan or institutional contract.
  • Scholara offers the broadest documented pipeline, including PICO, triple-agent screening, extraction, RoB 2, meta-analysis, and forest plots, yet its public plan limits currently conflict and should be contractually confirmed.
  • Covidence provides the clearest 2026 example of responsible restraint: it withheld an automatic exclusion feature despite 98.6% average recall because one review fell to 65.7%.
  • Elicit is most valuable when every summary or extracted value remains tied to source evidence. Its current public pricing page shows several audience variants, so displayed prices are not universally interchangeable.
  • ASReview is open-source and technically flexible, but workload reduction depends on prevalence, labels, model choice, stopping rules, and domain fit. Python operation and validation are real costs.
  • PRISMA compliance requires documenting the precise automation intervention, preserving record-report-study relationships, reporting stopping rules, and retaining an audit trail.
  • Validate at the review level, not only the record level. Report minima, confidence intervals, difficult document types, and tail audits rather than a single pooled accuracy number.
  • Keep humans responsible for final eligibility, critical extraction fields, risk-of-bias judgement, effect-size verification, and interpretation. AI should structure and prioritise evidence, not silently determine it.

Conclusion

The market for ai tools for systematic review has moved beyond simple relevance ranking. Rayyan now layers AI agents and extraction support onto a mature screening environment. Scholara aims to connect protocol, search, screening, extraction, risk of bias, and meta-analysis. Covidence is formalising a risk-based approach to deciding which automation should be released. Elicit is turning literature discovery and extraction into smaller, source-linked operations. ASReview continues to give researchers an open, testable active-learning stack.

The unresolved question is not whether these systems save time. They often will. The harder question is whether the saved time can be converted into better search design, more careful adjudication, stronger extraction checks, and more transparent reporting. Current evidence does not justify unattended full-text exclusion or autonomous scientific synthesis. Review-level variability, data leakage, document complexity, and automation bias remain material risks.

The most defensible 2026 workflow therefore treats AI as controlled infrastructure. Each automated step has a defined input, output, error consequence, validation method, and human owner. When that architecture is explicit, software can accelerate evidence synthesis without weakening the chain of reasoning that makes a systematic review trustworthy.

FAQs

What is the best AI tool for systematic review screening?

Rayyan is the best all-round choice for collaborative screening, while ASReview is strongest for open-source active learning. Covidence suits governance-heavy teams, and Scholara adds triple-agent screening inside a broader end-to-end workflow. The best choice depends on whether every record will receive a human decision or a validated stopping rule will reduce workload.

Can AI complete a systematic review automatically?

No current tool should be trusted to complete a rigorous review without human oversight. AI can help search, rank, screen, extract, and format analyses, but humans should approve eligibility, verify critical data, assess risk of bias, select statistical methods, and interpret results. Full-process automation remains an open research problem.

Is Rayyan free for systematic reviews?

Rayyan has a free plan with three active reviews and two free reviewers, plus duplicate detection and AI relevance features. Advanced AI agents, larger reviewer allowances, institutional management, SSO, and API access belong to paid or contract tiers. Current entitlements should be checked on the pricing page before a large project begins.

Does ASReview really reduce screening by 95%?

ASReview reports potential workload reductions of up to 95%, but this is not guaranteed. Savings depend on relevant-study prevalence, terminology, initial labels, model choice, review heterogeneity, and the stopping rule. Teams should simulate on completed reviews and audit random low-ranked records before using a reduced-screening design.

How can AI screening remain PRISMA compliant?

Report the tool, feature, version or access date, task, input, output, human role, conflict process, stopping rule, and validation. Preserve raw search exports, record deduplication counts, distinguish records from studies, log full-text exclusion reasons, and retain an audit trail. A generated PRISMA diagram is not sufficient by itself.

Which tool is best for data extraction?

Elicit is strong for source-linked evidence tables and custom extraction. Scholara provides extraction inside an integrated review and meta-analysis workflow. Rayyan ResearchPilot and Covidence also offer extraction assistance. Whichever tool is used, critical values should be checked against the PDF for arm, denominator, time point, unit, and effect direction.

Is Scholara better than Covidence for large teams?

Scholara offers broader end-to-end automation and per-seat Team limits, while Covidence offers a more established team workflow, unlimited collaborators on review packages, and organisation-wide governance. Scholara’s current public usage limits contain contradictions, so large teams should verify caps, security certification, and support terms before comparing total cost.

Can Elicit replace PubMed or Google Scholar?

Elicit can accelerate discovery, semantic search, summaries, and extraction, but it should not replace reproducible database searches for a formal systematic review. Use it to identify terms, seed papers, and evidence patterns, then run documented database-specific searches and preserve exact strategies, dates, and exports.

References

ASReview. (2026). ASReview LAB 3.0.7 installation documentation. https://asreview.readthedocs.io/en/stable/lab/installation.html

Covidence. (2026). Beyond evaluation: Deciding when AI is appropriate in evidence synthesis. https://www.covidence.org/blog/beyond-evaluation-deciding-when-ai-is-appropriate-in-evidence-synthesis/

Elicit. (2025, February 20). Systematic reviews in Elicit. https://elicit.com/blog/systematic-reviews

Li, L., Mathrani, A., & Susnjak, T. (2026). Transforming evidence synthesis: A systematic review of the evolution of automated meta-analysis in the age of AI. Research Synthesis Methods. https://doi.org/10.1017/rsm.2025.10065

Ouzzani, M., Hammady, H., Fedorowicz, Z., & Elmagarmid, A. (2016). Rayyan: A web and mobile app for systematic reviews. Systematic Reviews, 5, 210. https://doi.org/10.1186/s13643-016-0384-4

Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., et al. (2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ, 372, n71. https://www.bmj.com/content/372/bmj.n71

Systematic Review and Evidence Synthesis Infrastructure. (2025, November 11). Position statement on artificial intelligence use in evidence synthesis. https://www.sei.org/publications/position-statement-on-ai-use-in-evidence-synthesis/

Stuhlmüller, A. (2026, April 10). What’s going on in AI? Situational awareness, April 2026. Elicit. https://elicit.com/blog/situational-awareness-april-2026

van de Schoot, R., de Bruin, J., Schram, R., Zahedi, P., de Boer, J., Weijdema, F., et al. (2021). An open source machine learning framework for efficient and transparent systematic reviews. Nature Machine Intelligence, 3, 125-133. https://doi.org/10.1038/s42256-020-00287-7