Building a Contract Review Assistant with Dify + Knowledge Base: How to Handle PDF Documents and Set Retrieval Parameters

In enterprise scenarios, contract review does not mean having AI replace the legal team. Instead, it is better suited as a first layer of capability for contract pre-screening, clause identification, and risk flagging.

This type of scenario is highly suitable for implementation using Dify’s Knowledge and Workflow capabilities. The fundamental work of contract review can typically be broken down into the following steps:

Read contract text
Retrieve supporting evidence from templates, policies, and rules
Output structured risk alerts and revision suggestions

The two most critical issues are usually not “whether the model is powerful enough” but rather:

How should PDF documents be processed
How should retrieval parameters be configured for more stable results

This article focuses on these two questions, explaining how to build a practical contract review assistant using Dify + Knowledge Base.

1. First, Clarify: What Role Should the Contract Review Assistant Play

In enterprise practice, a contract review assistant is better suited as a “pre-screening tool” rather than the final decision-maker.

Tasks it is better suited to handle include:

Extracting basic contract information such as parties, amounts, terms, and payment conditions
Locating key clauses such as breach of contract, confidentiality, intellectual property, and auto-renewal
Conducting preliminary comparisons against enterprise standard templates or legal rules
Outputting a risk alert checklist
Generating draft revision suggestions for business or legal teams

The following, however, should typically remain with human judgment:

Major transaction structure assessments
Cross-border compliance issues
Industry regulatory detail interpretation
Complex disputes that exceed the enterprise’s existing rule framework

Therefore, the goal of a mature contract review assistant is not to “replace the legal team” but to first complete the standardized, repetitive, and rule-based initial review work.

2. Step One: Prepare Two Types of Materials

A contract review assistant typically requires at least two types of input materials.

1. Contracts to Be Reviewed

These are PDFs, Word documents, or other contract texts uploaded by business departments.

2. Review Reference Materials

For example:

Enterprise standard contract templates
Legal review checklists
Risk clause rule tables
Contract approval policies
Common issue documentation
Historical revision guidelines

In actual projects, many teams prioritize uploading the contracts themselves but overlook organizing “review reference materials.” As a result, the system can only produce generic summaries rather than genuinely valuable review conclusions.

3. Step Two: Process PDF Documents

In contract scenarios, PDF is the most common and the most likely file format to affect retrieval quality.

Common Issues

Scanned PDFs cannot have text extracted directly
OCR is required before entering subsequent processing.
Complex formatting
Headers, footers, tables, stamps, and page numbers can interfere with chunking.
Multiple templates mixed in a single file
This directly impacts the stability of subsequent retrieval results.
Excessive duplicate clauses
These can crowd out effective content during ranking.

Recommended Processing Approach

Before uploading, it is recommended to perform basic preprocessing on PDFs:

Prefer PDFs with extractable text
Apply OCR to scanned documents first
Remove covers, blank pages, and pages with only signatures/stamps that have no substantive meaning
Keep one contract per file whenever possible
If formatting is complex, consider converting to cleaner text or Markdown structure first

If the enterprise has many contract sources, it is recommended to establish a dedicated “contract cleaning process” before formal development – standardize raw files first, then import them into the Dify knowledge system.

4. Step Three: Separate the Knowledge Layer Clearly

In contract review scenarios, we do not recommend dumping all materials into a single unified knowledge base. Instead, we recommend splitting by role.

Knowledge Base A: Enterprise Contract Templates

Standard procurement contract
Standard sales contract
Service contract template
NDA template

Knowledge Base B: Review Rules and Red Lines

Risk clause checklist
Legal review guidelines
Approval authority rules
Exception case documentation

Knowledge Base C: Supplementary Policies and Compliance Materials

Seal management policy
Payment approval policy
Data compliance requirements
Industry-specific constraints

The value of this approach is that the system can subsequently select a more appropriate knowledge scope based on the contract type, rather than performing undifferentiated retrieval across all materials.

5. Step Four: Design the Contract Review Workflow

A deployable basic contract review flow can typically be designed as:

Upload contract
→ Extract contract text
→ Determine contract type
→ Retrieve corresponding templates and review rules
→ Extract key clauses
→ Generate risk checklist and recommendations
→ Output structured review results

In Dify, this can typically be split into the following nodes:

Input: Enter contract text or upload processed text
LLM Node: Identify contract type
Knowledge Retrieval: Retrieve templates and rules
LLM Node: Extract key clauses
LLM Node: Output risk analysis based on rules
Answer / JSON Output: Output structured review results

Recommended Output Fields

Rather than outputting a natural language summary, we recommend using structured results, such as:

Contract type
Contract term
Payment clause summary
Auto-renewal clause
Breach of contract clause
Potential risk points
Recommended revisions
Whether manual review is advised

This structure is more conducive to subsequent integration with approval systems, legal ledgers, or internal reporting processes.

6. Step Five: Properly Understand Retrieval Parameters

In a contract review assistant, retrieval quality determines output quality.

If the system fails to retrieve the correct clauses, templates, or rules, then the more complete the subsequent generation steps are, the larger the potential deviation.

Therefore, during the development phase, focus on the following parameter considerations.

1. Top K Should Not Be Too Small, Nor Blindly Too Large

Top K represents how many relevant chunks the system returns in a single retrieval.

Too small: Risk missing key supporting evidence
Too large: Risk introducing excessive noise, affecting model focus

In contract scenarios, clause-locating questions may work with a smaller Top K, while comprehensive review questions typically need more contextual support. Therefore, Top K should not be fixed to a single value but adjusted based on question type.

2. Do Not Treat “Disabling” as a Long-Term Strategy

In some RAG practices, teams will disable repeated or unwanted chunks, assuming this prevents them from affecting results.

However, in many cases, duplicate chunks can still influence the ranking process, causing truly effective content to be pushed out of the top results. Rather than relying on disabling, a more reliable approach is:

Clean up duplicate clauses before uploading
Remove outdated versions
Do not leave cleaning tasks until after retrieval

3. Chunks Should Follow Contract Structure, Not Just Character Count

Contracts are not ordinary prose. If chunking is too fine, clause context will break; if too coarse, retrieval focus will degrade.

A more reasonable approach is typically:

Chunk by clause or subsection
Preserve clause headings
Aim for each chunk to express one complete rule

For example, “Payment Terms” and “Breach of Contract” should not fall into the same large chunk, nor should a complete clause be artificially split into multiple fragmented pieces.

4. Rewrite Ambiguous Questions First

In contract review, common questions tend to be fairly ambiguous, for example:

“Is there any risk in this clause?”
“Does this need to be changed?”
“Is this contract compliant?”

These types of questions do not perform well when sent directly to retrieval. A better approach is to first use a pre-processing LLM node to rewrite the question into a more specific retrieval request, for example:

“Check whether the payment terms are consistent with the company template”
“Locate the content in the contract regarding auto-renewal and breach of contract”

This approach can significantly improve relevance.

7. Step Six: Prompts Must Emphasize “Evidence” and “Boundaries”

One of the biggest risks in contract review scenarios is the model outputting seemingly reasonable but unsupported judgments when materials are insufficient.

Therefore, in the answer prompt, it is recommended to explicitly constrain the following principles:

You are a contract review assistant.
Please analyze strictly based on the provided contract content and review rules.
Requirements:
- Do not fabricate non-existent clauses
- Do not present speculation as conclusions
- Cite the basis for each risk point
- If materials are insufficient, clearly indicate that manual review is needed
- Output should be as structured as possible

If the enterprise wants to further enhance readability, risk levels can also be added, such as:

High risk
Medium risk
Low risk
Insufficient information

8. Step Seven: Build a Small-Scale Test Set

Before going live, we recommend starting with a single contract category for testing, such as:

NDA
Standard procurement contract
Service contract

Then prepare 10 to 20 test samples covering the following situations:

Standard template versions
Manually modified versions
Versions with obvious risk clauses
Versions with incomplete information or poor OCR quality

During evaluation, focus on:

Whether contract type identification is correct
Whether key clauses can be stably extracted
Whether risk alerts have supporting evidence
Whether results are suitable for direct reading by business or legal teams

9. Recommended Deployment Approach

When promoting this type of project internally, we typically do not recommend naming it “AI Contract Review System” in the first phase.

Naming that is more likely to be accepted by business and legal teams includes:

Contract pre-screening assistant
Contract information extraction assistant
Clause risk alert assistant
Contract template comparison assistant

This type of naming better reflects the current capability boundaries of AI and helps set reasonable expectations within the organization.

Conclusion

When building a contract review assistant with Dify + Knowledge Base, the factors that truly determine effectiveness are not just the model but whether three foundational areas are solidly addressed:

Whether PDF documents have been cleaned into high-quality text
Whether review reference materials have been organized into a clear knowledge layer
Whether retrieval parameters and chunking strategies have been optimized around contract structure

Once these three things are handled well, Dify can effectively support a practical contract pre-screening workflow: first extract information, then locate clauses, then output risk alerts based on rules, and finally return complex judgments to human review.

This is also the approach that enterprises can most easily put into real use today: not having AI directly replace the legal team, but first making AI an efficient screening layer before legal review.

Keyboard shortcuts

MKC — Dify Japan Content System