How AI Arenduskeskus evaluates whether advanced reasoning models can support high-stakes enterprise AI implementation decisions — in environments where fluent model output is not the same as a safe one.
Most enterprise AI evaluation work measures whether a model produces fluent, confident answers. Our focus is narrower: dirty operational-data migration and implementation decisions where a plausible answer can still be unsafe.
We evaluate the gap between fluent output and defensible reasoning under realistic enterprise conditions: messy operational data, partial documentation, conflicting source systems, and governance constraints that change which model providers are even eligible.
The methodology described here supports AI Arenduskeskus's commercial Discovery Sprint offering for European business clients: deciding whether to build, pilot, route, or reject an AI implementation before major development spending.
AI Arenduskeskus is an applied AI evaluation company in the European Union. We assess whether advanced reasoning models can be safely used in specific enterprise AI implementation decisions, with a particular focus on dirty operational-data migration.
The methodology focuses on a narrow question that standard model benchmarks and generic advisory work often leave unresolved: whether a specific reasoning capability is safe enough for a specific enterprise implementation decision, against a specific dataset, under specific governance constraints.
The methodology is developed and stress-tested against a private benchmark we maintain internally — synthetic and anonymised data designed to mirror the structural failure modes of real client systems without exposing any client information. The methodology is public; the bench is not, by design, to prevent benchmark contamination.
For the Gemini 3 Deep Think EAP, the same methodology is used to compare Gemini 3 Deep Think against the Gemini 3.1 Pro high-thinking baseline on private enterprise reasoning cases.
Public benchmarks measure how well a model performs on tasks designed to be evaluable: clean inputs, single correct answers, narrow domains. Frontier reasoning models score very well on these. The question we ask is different: can the same model help an enterprise team decide whether to migrate a customer database, restructure a service catalogue, or commit budget to an AI build — when the input data is incomplete, the source documentation is wrong in places, and the cost of a confident-but-wrong answer is months of wasted engineering?
This is not a model-quality question in the traditional sense. A weaker model that surfaces its uncertainty is more useful here than a stronger model that confidently produces a fluent, plausible, incorrect answer. Our methodology is designed to distinguish those two cases.
The unit of analysis is the implementation decision — build, pilot, route to a human reviewer, or do not build — for a specific reasoning task against a specific dataset under specific governance constraints.
The single task type where we have invested most evaluation work is operational-data migration: moving customer, service, contract, or asset data from a legacy system into a target system in a way that is correct, traceable, and governance-defensible.
It is the right flagship case because it concentrates almost every property that makes enterprise AI hard. The data is dirty — duplicates, near-duplicates, fake or test records, conflicting service codes, orphan rows, malformed entries that have been live in production for years. The documentation is partial. The business owners cannot answer every question on demand. And the cost of a wrong migration is not abstract: it surfaces months later as customers billed twice, service requests routed to closed accounts, or governance reports that fail review.
On a public benchmark, a strong reasoning model will produce confident answers to the structured questions a migration raises. Against an unfamiliar real dataset, the same model will frequently produce something that looks identical — and is wrong in ways the business will not detect until production. The migration setting forces evaluation to address that gap directly.
The table below summarises six failure modes we see repeatedly when evaluating reasoning models against operational-data tasks. The "weak output" column describes what conventional evaluation often fails to penalize. The "strong output" column describes what an organisation actually needs before signing off on a build.
| Failure mode | Weak output | Strong output |
|---|---|---|
| Duplicate entities | Returns one canonical record; ignores near-duplicates with conflicting fields. | Surfaces the duplicates, names the conflict, and proposes a reconciliation rule the business can approve. |
| Fake or test records | Treats all rows as real; produces statistics inflated by historical test data left in production. | Flags suspected non-production records with a stated heuristic and asks for a human ruling before proceeding. |
| Service-code conflicts | Picks one mapping between conflicting code systems and proceeds confidently. | Reports the conflict explicitly, lists which systems disagree, and refuses to commit a mapping without authority. |
| Orphan records | Returns clean joins; silently drops rows whose foreign keys do not resolve. | Quantifies the orphan population, characterises it, and asks whether to migrate, archive, or reject. |
| ROI uncertainty | Produces a confident projected benefit based on assumptions the model invented. | States the projection's load-bearing assumptions and the conditions under which the conclusion would reverse. |
| Governance constraints | Recommends a model or vendor without considering data-residency, sectoral, or procurement rules. | Names which providers are eligible for this dataset and which are not, with the rule that excludes them. |
A reasoning model that produces a "strong output" answer is harder to score on a leaderboard — because the right answer is often "this question cannot be answered safely from this data". That is precisely the answer enterprise buyers need, and the one most evaluations are not designed to reward.
Each evaluation run pairs a reasoning task with a dataset designed to mirror the structural failure modes above. The model under test produces a response; the response is then assessed along three axes — traceability, calibrated uncertainty, and decision usefulness — and supported by deterministic validators, fatal-error rules, and human scorecards.
Each non-trivial claim in the response should be tied to a specific input the model was given. If the model asserts that two rows refer to the same entity, the response should name the fields that support the assertion. If the model recommends migrating a record, the recommendation should reference the rule applied. Responses that produce conclusions without citable inputs are scored low even when the conclusion happens to be correct.
A response that says "I am confident" about something the input data does not support is scored worse than a response that says "I cannot determine this from the data given" about the same question. Calibration is measured against a reference set where the right answer is known to be unknowable from the inputs.
The output should be usable by an accountable human for the next step: approve a migration rule, escalate a conflict, request additional data, or reject the task. Outputs that read fluently but do not narrow the human's decision space are scored low. Outputs that decline a question and state precisely what additional input would change the answer are scored high.
Where outputs can be checked programmatically, validators test artifact presence, structural correctness, count consistency, planted fake records, expected entities, review-queue consistency, and reconciliation between output files. Validators do not replace human judgment, but they prevent polished answers from passing when the underlying artifacts are wrong.
Some failures override average scores. A migration that silently canonicalizes fake records, erases a service-code conflict, drops orphan records, or produces reconciliation mismatches is unsafe regardless of how fluent the report looks. Fatal-error rules identify outputs that cannot be recommended for implementation, even when other dimensions score well.
Where judgment is required, structured human scorecards assess risk posture, source discipline, implementation realism, executive usability, and routing recommendation. Scoring is performed against a fixed rubric and documented per run. Where reasonable people would disagree on a score, the disagreement is recorded as part of the result rather than averaged away.
Gemini 3.1 Pro high-thinking is the current baseline. If Gemini 3 Deep Think API access is granted, AI Arenduskeskus will rerun the same cases against Deep Think under the same methodology and compare the results against the Gemini 3.1 Pro high-thinking baseline.
Many of the failure modes above share a common shape: the safe answer requires holding several conflicting hypotheses simultaneously, evaluating each against the available evidence, and reporting the conflict — rather than collapsing to a single confident answer too early.
We are interested in evaluating advanced reasoning models, including ones offering deeper multi-hypothesis reasoning, against this specific shape of task. The hypothesis we want to test is whether additional reasoning depth produces measurably better behaviour on our private bench — particularly on the failure modes where current models tend to converge prematurely. We are not asserting that Deep Think will outperform the baseline; we are testing whether it does.
The question is not whether Deep Think writes a better report. The question is whether Deep Think improves final-stage reasoning on ambiguous enterprise engineering decisions: what is safe to canonicalize, what must remain unresolved, what must go to human review, and when the correct answer is not to build.
The result of that evaluation is itself a useful artifact for enterprise decision-making: it tells a buyer whether a given reasoning capability is implementation-ready for their problem class, or whether the problem currently belongs in a "do not automate" envelope. We expect both outcomes to occur. Evaluations that only ever produce "go" recommendations are not evaluations.
If selected for the EAP, feedback to the Gemini team would be structured around hidden-trap detection, ambiguity handling, provenance discipline, implementation feasibility, latency and token-cost trade-offs, and remaining failure modes.
AI Arenduskeskus operates a two-layer separation between what is published and what is held internally:
Client work is treated as a third, stricter layer. No client-identifying data enters the bench, and no bench contents are shared with clients. Where evaluation work informs a real client engagement, the connection is described at the level of pattern, not record.
AI Arenduskeskus operates in the European Union and designs its methodology with GDPR and EU AI Act constraints in mind. Where an evaluation surfaces a question that crosses into legal, sectoral, or compliance territory, the response notes that a governance review is required and identifies the specific question — rather than attempting to settle it.
This methodology is deliberately narrow. The list below is not a disclaimer; it is part of the design. Buyers and partners should expect the following to be out of scope:
AI Arenduskeskus OÜ is a registered Estonian company operating in the European Union and the broader EMEA region, with a working footprint across Tartu, Tallinn, and Riga. Our work is applied AI evaluation: assessing specific reasoning tasks against specific datasets, in service of specific enterprise implementation decisions.
The methodology described on this page supports AI Arenduskeskus's commercial Discovery Sprint offering for European business clients. Practical information about that offering is published in Estonian on our home page, where European buyers can also reach us directly.