Closing the Honesty Deficit: A Guide to Legal AI Risk Management

By Colin Levy

February 26, 2026

Closing the Honesty Deficit: A Guide to Legal AI Risk Management

Colin Levy leads the legal function as General Counsel and Evangelist of Malbek, a leading CLM provider. Levy also advises startups and invests in emerging technologies that propel the industry forward. He has authored "The Legal Tech Ecosystem" and "CLM for Dummies” and contributes regularly to many publications. He can be reached at colin.levy@malbek.io.

Legal AI has an honesty deficit. The technology performs certain tasks faster and more consistently than any human lawyer. It performs other tasks so unreliably that deploying it without guardrails creates malpractice exposure. The problem is that vendors, executives, and most legal professionals cannot tell you which tasks fall into which category.

This matters because the confusion is not resolving itself. The general counsel who watched a contract review demo now expects review cycle times to drop dramatically. The CFO who read that AI can automate due diligence wants to know why the last acquisition still required a full team of contract attorneys for three weeks. The business unit leader who used a large language model to draft a vendor agreement does not understand why legal rejected it and rewrote it from scratch. Each of these people encountered AI doing something real. None of them saw the boundary where capability ends and legal AI risk management begins. 

Legal professionals who cannot draw that boundary clearly will lose control of AI adoption in their organizations. Business units will deploy tools without legal oversight. Finance will set savings targets based on vendor marketing rather than operational reality. And when something goes wrong, legal will be asked why it did not prevent the problem it was never resourced to manage.

Where AI actually works: the mechanical layer

AI delivers consistent value on tasks that are repetitive, measurable against predefined criteria, and forgiving enough that verification catches errors before consequences materialize. These tasks are not trivial. They consume a disproportionate share of lawyer time. But they are fundamentally pattern-matching problems, and pattern matching is what current AI does well.

Playbook-based contract review is the most mature use case. Contract intelligence platforms can compare language against a defined playbook, flag deviations from standard positions, extract key terms into structured data, and identify missing clauses. The technology does not get fatigued at hour 14 of a due diligence sprint. It does not skip a contract because the deal team said it was standard. It applies the same criteria to every agreement in a portfolio with a level of consistency that no human team can match across high-volume work.

But these systems are better at the mechanical component of contract review than at contract review itself. An AI tool that flags a non-standard indemnification clause has no idea whether that deviation was a negotiating concession the client agreed to last week or a drafting error that needs correction. It cannot tell you whether the deviation matters given the commercial context of the deal. That distinction is where AI stops being useful and starts being dangerous, and it is a distinction the technology is structurally incapable of making.

EDiscovery triage is similarly well-suited. Technology-assisted review has been accepted by courts for over a decade. The workflow that works is AI-first: the model classifies, clusters, and prioritizes the document population, and human reviewers focus on materials requiring judgment rather than reviewing everything sequentially. This is not AI replacing review. It is AI restructuring the reviewer’s workload so that human attention goes where it matters most.

Portfolio-level compliance screening leverages AI’s consistency advantage. When a legal team needs to know which of its vendor agreements contain a specific force majeure formulation or which contracts lack a data processing addendum required by a new regulation, AI can surface those answers in hours rather than weeks. The value is not in reading any single contract. It is in reading all of them and telling you which ones need human attention.

First-draft generation works within a narrow band. Non-disclosure agreements (NDAs), standard vendor terms, employment offer letters, and similar documents with well-defined parameters benefit from AI drafting. The moment the agreement requires customization or integration of deal-specific context, the quality drops sharply enough that fixing the output often costs more time than drafting from scratch.

Actions to take next:

  • Map every legal task currently performed by the team against the three criteria: repetitive, measurable against predefined standards, and low individual-error cost. Tasks that meet all three are candidates for AI assistance. Tasks that meet none should stay fully human.
  • Run a time audit on two or three high-volume workflows to identify what percentage of lawyer time is spent on mechanical steps versus judgment steps. The mechanical percentage is the realistic ceiling for AI-driven efficiency gains.
  • Test your current AI tools against your own playbook, not the vendor’s demo data, and document where extraction accuracy degrades. Those gaps define where human verification must remain intensive.

Read the latest thought leadership and analysis from legal experts

Where AI fails: the judgment layer

The tasks where AI fails are not harder versions of the tasks where it succeeds. They are structurally different. They require reasoning about situations the model has not encountered, understanding how provisions interact across an agreement, and making decisions that depend on information the model cannot access: the client’s risk appetite, the counterparty’s negotiation history, the regulatory trajectory in a specific jurisdiction.

Novel legal reasoning is the clearest failure point. Legal practice runs on analogy: determining whether a new situation is sufficiently similar to a precedent that the same rule should apply. This is not a similarity-matching exercise. It is an evaluative judgment about which similarities are legally relevant, and that evaluation shifts as courts reinterpret doctrine and legislatures amend statutes. Current models can retrieve cases with overlapping fact patterns. They cannot determine whether the overlap matters in a way that would persuade a judge or inform responsible client advice.

Contextual risk assessment exposes the gap between extraction and understanding. An AI tool reviewing a limitation of liability clause can tell you the cap is set at the value of the contract over the preceding 12 months. It cannot tell you that this cap is appropriate for a low-value software subscription and wildly inadequate for a retainer on a mission-critical infrastructure service where a failure could produce orders-of-magnitude greater downstream loss. The same clause, the same language, the same extraction result, entirely different risk profiles. The lawyer knows this because the lawyer understands the business context. The model does not. No amount of prompt engineering fixes a fundamental absence of situational awareness.

Bespoke agreement drafting reveals the limits even in AI’s strongest domain. The same benchmark evaluations that show AI outperforming humans on mechanical drafting consistently show humans excelling at interpreting client intent, avoiding unnecessary concessions, and integrating multiple information sources into a coherent risk allocation. A joint venture agreement or a technology licensing deal with layered exclusivity provisions requires a drafter who understands what the client is trying to achieve and why the counterparty will resist specific terms. AI generates language. Lawyers generate outcomes.

Privilege analysis cannot be automated. Determining whether a document is protected by attorney-client privilege requires understanding the purpose of the communication, the roles of the people involved, and the circumstances under which it was created. Courts have consistently held that privilege is a legal determination requiring human judgment, not a classification problem a model can solve. Any workflow that relies on AI for privilege screening without robust human review at every decision point is building in risk that will eventually surface at the worst possible time.

Actions to take next:

  • Identify three to five recent matters where the outcome depended on judgment that AI could not have provided. Use these as concrete examples when explaining limitations to stakeholders who equate contract review demos with full legal automation.
  • Establish a written policy specifying which task categories require human decision-making regardless of AI involvement. Privilege determinations, conflict checks, and risk assessments above a defined threshold should be on this list.
  • Review any AI-assisted workflow that currently lacks a defined handoff point between model output and human review. If the handoff is informal, formalize it before an error forces the issue.

The hallucination problem is structural, not transitional

The most dangerous feature of current legal AI is not what it cannot do. It is what it does wrong while appearing authoritative. Large language models generate text that reads as confident and well-sourced regardless of whether the underlying content is accurate. In legal work, this manifests as fabricated case citations, real citations paired with arguments the cited case does not support, and quotations attributed to judicial opinions that never contained the quoted language.

These are not rare edge cases. Independent evaluations of the major legal AI research platforms have consistently found hallucination rates high enough that any unverified AI-assisted research submission carries meaningful risk. The flagship products from the largest legal information providers, the platforms most lawyers trust, produce inaccurate or incomplete responses at rates that would be unacceptable in any other context where professional liability attaches to the output.

The hallucinations are dangerous precisely because they are plausible. A fabricated citation follows the correct format. A hallucinated holding sounds like something the court might have said. Catching these errors requires a lawyer who already knows the relevant law well enough to recognize that the AI’s output does not match reality. This inverts the value proposition: the lawyers best positioned to verify AI research are the ones who least need AI to do the research in the first place.

The financial and professional consequences of unverified AI work product are escalating. Courts have imposed sanctions on attorneys who submitted AI-generated filings without independent verification. Malpractice insurers are increasingly excluding AI-related claims from standard coverage. A legal department using AI research tools without rigorous verification protocols is accumulating exposure that compounds with every unverified output.

Treating hallucination as a temporary problem that better models will resolve is a strategic error. The same architectural features that make large language models effective at summarization and extraction, their ability to generate fluent, contextually appropriate text, are the features that produce hallucinations. The model does not distinguish between generating a correct citation and generating a plausible one. Improvement will be incremental, not categorical. Legal departments that wait for the problem to resolve itself will accumulate risk in the interim.

Actions to take next:

  • Test your AI research tools against a set of known queries where you can verify citations independently. Document the hallucination rate on your own workload, not the vendor’s benchmark dataset.
  • Build verification requirements into workflow documentation so that AI-assisted research is never submitted without independent citation checking. Make this a process requirement, not a suggestion.
  • Check your professional liability coverage for AI-related exclusions. If your insurer excludes AI claims, you need either a policy rider or a verification protocol rigorous enough to satisfy their risk assessment.

The gap between vendor promises and deployment

Legal AI spending is growing rapidly. The numbers measure investment, not impact. The majority of law firms have integrated AI tools into their workflows, but most use them for pattern recognition tasks: document review, clause extraction, basic research queries. A small minority have formal AI governance policies. An even smaller minority are measuring return on investment at all, which means most organizations with AI tools cannot tell you whether those tools are working.

The productivity gains that do exist are real but narrow. Meaningful efficiency improvements are documented for specific mechanical tasks. These numbers describe the best-case scenario for the tasks best suited to AI, not the average outcome across a legal department’s full workload. The difference matters because executive expectations are typically set by the best-case numbers while operational reality reflects the average.

Most vendor evaluation processes are not designed to surface this gap. They evaluate demos and reference calls. They do not evaluate accuracy on the buyer’s own documents, failure modes under the buyer’s specific conditions, or the total cost of implementation including workflow redesign, training, verification protocols, and ongoing oversight. A tool with a manageable license fee that requires multiples of that fee in staff time to implement and maintain is not delivering the return the business case projected.

The shadow AI problem makes this worse. Lawyers across most legal departments are already using AI tools informally, without governance, without verification protocols, and without anyone tracking what is being fed into which platform. Legal operations teams that do not inventory this usage cannot manage the risk it creates. And the risk is not hypothetical: every unvetted AI tool processing client data is a potential confidentiality breach, every unverified AI research output is a potential malpractice event.

Actions to take next:

  • Require every AI vendor under evaluation to run accuracy testing on your organization’s own documents. Any vendor that declines is not confident in their product’s real-world performance.
  • Calculate the fully loaded cost of each AI tool currently deployed, including staff time for implementation, training, workflow management, verification, and error remediation. Compare this to the vendor’s projected ROI.
  • Survey lawyers across the department to identify tools being used informally without oversight. If you do not know what AI tools are in use, you cannot govern them.

Regulatory requirements are becoming operational constraints

The regulatory environment around AI in legal practice is tightening in ways that create concrete operational requirements, not just abstract compliance obligations.

Technology competence rules increasingly mean that lawyers must understand the AI tools they use, not merely use them. Multiple jurisdictions now mandate continuing legal education credits focused on AI competency. Some require explicit disclosure of AI use in court submissions. These are enforceable obligations that define the baseline for competent practice, and the trend is toward more jurisdictions adopting similar requirements.

Uploading privileged communications to a commercial AI platform that does not contractually guarantee confidentiality and data segregation creates a privilege waiver risk. Some jurisdictions have classified failure to verify AI tool data handling practices as professional misconduct. When the majority of corporate legal departments report unauthorized AI usage without data controls, the combination of uncertain platform confidentiality and widespread shadow AI creates a privilege risk that most legal departments have not formally assessed.

Algorithmic bias has moved from theoretical concern to active litigation. Courts are certifying class actions that treat discrimination through AI as legally equivalent to discrimination by a human decision-maker. Regulators are finalizing rules that require meaningful human oversight, proactive bias testing, and multi-year record retention for automated decision systems. Legal departments using AI for any process that affects individuals, from employment screening to claims adjudication, should evaluate their tools against these emerging standards before a plaintiff’s lawyer does.

Actions to take next:

  • Review AI competence and disclosure requirements for every jurisdiction where your lawyers are licensed. Build these into onboarding and annual compliance training.
  • Conduct a data handling audit of every AI tool in use, including shadow tools used by individual lawyers. Verify that each tool’s practices satisfy your confidentiality and privilege protection requirements.
  • Assess whether any AI-assisted process involves decision-making that affects individuals. If so, evaluate it against the emerging regulatory standards for automated decision systems and the case law holding AI-driven discrimination equivalent to human discrimination.

The legal teams that will use AI effectively over the next several years are the ones that can hold two ideas simultaneously: the technology delivers genuine value on a defined set of mechanical tasks, and it creates genuine risk on everything else. Legal professionals who can specify exactly which tasks belong in each category, with evidence from their own operations rather than vendor marketing, will find themselves directing AI strategy rather than reacting to it. If they don’t take the lead, business units, consultants, and technology teams will make those decisions instead—leaving the legal department to deal with the consequences later.

Critical intelligence for general counsel

Stay on top of the latest news, solutions and best practices by reading Daily Updates from Today's General Counsel.

Daily Updates

Sign up for our free daily newsletter for the latest news and business legal developments.

Scroll to Top