How we test
Our methodology
Every Riftgrind review starts with the same process: a paid account, a fixed prompt set, a scoring rubric, and at least two test runs on different days. Here's exactly how it works.
Core principles
Every methodology decision traces back to four rules we don't break:
Hands-on only
We never review a tool we haven't used. Spec sheets, press releases, and marketing pages are starting points, not evidence.
Paid tier by default
Where a paid tier exists, we subscribe to the highest consumer tier so our experience matches what a serious user will see.
Repeated runs
Every test is repeated on at least two different days to smooth out server-side variability and model-load differences.
Fixed rubric
Scoring categories are locked before we start. We don't reshape the rubric to fit a tool we like or dislike.
The testing process
Every review follows the same six steps from first sign-up to publication:
- Sign-up & onboarding — we create a fresh account from an EU IP, record the onboarding flow, and note any dark patterns (forced upsells, pre-ticked marketing consent, unclear pricing).
- Paid upgrade — where a paid tier exists, we subscribe to the highest consumer plan and note the true post-tax price, billing cadence, and refund policy.
- Fixed prompt run — we execute the category's standard prompt pack (see below) and save every output, including failures, rate-limits, and moderation refusals.
- Side-by-side comparison — outputs are placed alongside competing tools that ran the same prompts on the same week, so differences are attributable to the tool rather than to prompt drift.
- Repeat run — at least 48 hours later we re-run a subset to check for consistency. Significant drift is flagged in the review.
- Rubric scoring — a second editor scores independently against the rubric. Differences greater than one point are discussed and resolved before publication.
Review evidence policy
Review claims must come from saved test outputs, account screenshots, pricing pages seen during testing, product documentation, or repeatable in-product checks. If a value is not available in the review data, we omit it or use neutral language such as "check current pricing" instead of guessing.
We do not create scores, prices, testing dates, author names, screenshots, pros, cons, or affiliate status from memory. Those fields need to exist in the review dataset before they appear in comparison cards or schema.
Our scoring rubric
Every tool is scored 1–10 across six axes. The final headline score is a weighted average, not a simple mean — output quality and reliability weigh more than brand or UI polish.
| Scoring area | Weight | Evidence we look for |
|---|---|---|
| Output quality | 30% | Representative prompts, saved outputs, failed generations, and side-by-side comparisons. |
| Reliability | 20% | Repeat runs, latency, failures, moderation behavior, and consistency over time. |
| Features and control | 15% | Editing controls, references, exports, seeds, revisions, and workflow fit. |
| Pricing transparency | 15% | Plan limits, credits, renewal, cancellation, paid gates, and hidden usage costs. |
| Ease of use | 10% | Onboarding, first useful result, navigation, and clarity for new users. |
| Safety and trust | 10% | Data controls, watermarking, provenance, age gates, and content policy clarity. |
Output quality (30%)
How good is what it produces on representative real-world prompts?
Reliability (20%)
Does it fail, stall, or produce broken output under normal load?
Features & control (15%)
Can you steer the output, iterate, and export in the formats you actually need?
Pricing transparency (15%)
Is pricing clear, fair, and free of hidden credit burns or dark patterns?
Ease of use (10%)
How quickly can a new user reach a usable result?
Safety & trust (10%)
Watermarking, content provenance, data retention, and moderation behavior.
Category-specific tests
Each of our three publications uses a category-tailored prompt pack. The packs are fixed within a quarter so that tools reviewed in the same window are directly comparable.
- AI Video (AiVidLab) — 10 scenes spanning cinematic shots, character animation, product renders, and motion graphics. Fixed seed where supported. We measure prompt adherence, motion coherence, temporal consistency, resolution, watermarking, and export formats.
- AI Companion (CompanionCompass) — scripted 20-turn conversations that probe memory, personality consistency, emotional tone, safety refusals, and roleplay boundaries. Each platform is tested in both free and paid modes.
- AI Image & Creator (AI Forge Lab) — a 12-prompt suite covering photorealism, illustration, typography, and specific creator use-cases. Seed-matched where possible. Output is scored blind by two editors before the product name is revealed.
Re-testing cadence
AI tools change fast, so a six-month-old review can be outdated by next Tuesday. To keep our rankings honest:
- Every published review is re-tested at least once every 90 days.
- We re-test immediately whenever a platform announces a new model, a pricing overhaul, or a major feature.
- Every review displays a visible "last tested" timestamp so you can tell at a glance whether what you are reading still applies.
- If a tool regresses, we lower its score and explain what changed. If it improves, we raise the score with the same transparency.
Conflicts of interest
Riftgrind is funded entirely by affiliate commissions. No vendor can buy a ranking, a recommendation, or a placement in a list. We accept free credits and early access for testing purposes only — these do not influence scores, and we disclose them in the review body when relevant.
Affiliate relationships are disclosed near monetized calls to action and summarized on the affiliate disclosure page. Commercial partnerships do not decide which tools we test, how we score them, or whether criticism remains in a review.
If an editor has a prior relationship with a vendor (previous employment, equity, advisory role), they recuse themselves from that tool's review and we assign a different editor.
Corrections and updates
When we get something wrong, we fix it and say so. Factual corrections are noted at the bottom of the review with a date. Score changes that result from re-testing are recorded in a visible change log on the review page. Spelling and formatting fixes are made silently.
Spotted something that looks off? Please email us via the contact page with "Correction" in the subject line.
Limitations we're honest about
No testing methodology is perfect. A few limitations we want you to know about:
- AI outputs are stochastic. Even with fixed seeds, some tools produce slightly different output across runs. We mitigate this with repeated testing, but we can't eliminate it.
- We test from the EU. Some tools geo-restrict features or apply stricter moderation in Europe; we note this where we see it.
- Our sample of prompts is representative, not exhaustive. A tool that scores poorly on our suite may still be great at a specific niche workflow we don't cover.
- Pricing and features can change between testing and publication. We flag anything we suspect is in flux.