Building a Contract Review Assistant with Dify + Knowledge Base: How to Handle PDF Documents and Set Retrieval Parameters
In enterprise scenarios, contract review does not mean having AI replace the legal team. Instead, it is better suited as a first layer of capability for contract pre-screening, clause identification, and risk flagging.
This type of scenario is highly suitable for implementation using Dify’s Knowledge and Workflow capabilities. The fundamental work of contract review can typically be broken down into the following steps:
- Read contract text
- Retrieve supporting evidence from templates, policies, and rules
- Output structured risk alerts and revision suggestions
The two most critical issues are usually not “whether the model is powerful enough” but rather:
- How should PDF documents be processed
- How should retrieval parameters be configured for more stable results
This article focuses on these two questions, explaining how to build a practical contract review assistant using Dify + Knowledge Base.
1. First, Clarify: What Role Should the Contract Review Assistant Play
In enterprise practice, a contract review assistant is better suited as a “pre-screening tool” rather than the final decision-maker.
Tasks it is better suited to handle include:
- Extracting basic contract information such as parties, amounts, terms, and payment conditions
- Locating key clauses such as breach of contract, confidentiality, intellectual property, and auto-renewal
- Conducting preliminary comparisons against enterprise standard templates or legal rules
- Outputting a risk alert checklist
- Generating draft revision suggestions for business or legal teams
The following, however, should typically remain with human judgment:
- Major transaction structure assessments
- Cross-border compliance issues
- Industry regulatory detail interpretation
- Complex disputes that exceed the enterprise’s existing rule framework
Therefore, the goal of a mature contract review assistant is not to “replace the legal team” but to first complete the standardized, repetitive, and rule-based initial review work.
2. Step One: Prepare Two Types of Materials
A contract review assistant typically requires at least two types of input materials.
1. Contracts to Be Reviewed
These are PDFs, Word documents, or other contract texts uploaded by business departments.
2. Review Reference Materials
For example:
- Enterprise standard contract templates
- Legal review checklists
- Risk clause rule tables
- Contract approval policies
- Common issue documentation
- Historical revision guidelines
In actual projects, many teams prioritize uploading the contracts themselves but overlook organizing “review reference materials.” As a result, the system can only produce generic summaries rather than genuinely valuable review conclusions.
3. Step Two: Process PDF Documents
In contract scenarios, PDF is the most common and the most likely file format to affect retrieval quality.
Common Issues
-
Scanned PDFs cannot have text extracted directly
OCR is required before entering subsequent processing. -
Complex formatting
Headers, footers, tables, stamps, and page numbers can interfere with chunking. -
Multiple templates mixed in a single file
This directly impacts the stability of subsequent retrieval results. -
Excessive duplicate clauses
These can crowd out effective content during ranking.
Recommended Processing Approach
Before uploading, it is recommended to perform basic preprocessing on PDFs:
- Prefer PDFs with extractable text
- Apply OCR to scanned documents first
- Remove covers, blank pages, and pages with only signatures/stamps that have no substantive meaning
- Keep one contract per file whenever possible
- If formatting is complex, consider converting to cleaner text or Markdown structure first
If the enterprise has many contract sources, it is recommended to establish a dedicated “contract cleaning process” before formal development – standardize raw files first, then import them into the Dify knowledge system.
4. Step Three: Separate the Knowledge Layer Clearly
In contract review scenarios, we do not recommend dumping all materials into a single unified knowledge base. Instead, we recommend splitting by role.
Knowledge Base A: Enterprise Contract Templates
- Standard procurement contract
- Standard sales contract
- Service contract template
- NDA template
Knowledge Base B: Review Rules and Red Lines
- Risk clause checklist
- Legal review guidelines
- Approval authority rules
- Exception case documentation
Knowledge Base C: Supplementary Policies and Compliance Materials
- Seal management policy
- Payment approval policy
- Data compliance requirements
- Industry-specific constraints
The value of this approach is that the system can subsequently select a more appropriate knowledge scope based on the contract type, rather than performing undifferentiated retrieval across all materials.
5. Step Four: Design the Contract Review Workflow
A deployable basic contract review flow can typically be designed as:
Upload contract
→ Extract contract text
→ Determine contract type
→ Retrieve corresponding templates and review rules
→ Extract key clauses
→ Generate risk checklist and recommendations
→ Output structured review results
In Dify, this can typically be split into the following nodes:
- Input: Enter contract text or upload processed text
- LLM Node: Identify contract type
- Knowledge Retrieval: Retrieve templates and rules
- LLM Node: Extract key clauses
- LLM Node: Output risk analysis based on rules
- Answer / JSON Output: Output structured review results
Recommended Output Fields
Rather than outputting a natural language summary, we recommend using structured results, such as:
- Contract type
- Contract term
- Payment clause summary
- Auto-renewal clause
- Breach of contract clause
- Potential risk points
- Recommended revisions
- Whether manual review is advised
This structure is more conducive to subsequent integration with approval systems, legal ledgers, or internal reporting processes.
6. Step Five: Properly Understand Retrieval Parameters
In a contract review assistant, retrieval quality determines output quality.
If the system fails to retrieve the correct clauses, templates, or rules, then the more complete the subsequent generation steps are, the larger the potential deviation.
Therefore, during the development phase, focus on the following parameter considerations.
1. Top K Should Not Be Too Small, Nor Blindly Too Large
Top K represents how many relevant chunks the system returns in a single retrieval.
- Too small: Risk missing key supporting evidence
- Too large: Risk introducing excessive noise, affecting model focus
In contract scenarios, clause-locating questions may work with a smaller Top K, while comprehensive review questions typically need more contextual support. Therefore, Top K should not be fixed to a single value but adjusted based on question type.
2. Do Not Treat “Disabling” as a Long-Term Strategy
In some RAG practices, teams will disable repeated or unwanted chunks, assuming this prevents them from affecting results.
However, in many cases, duplicate chunks can still influence the ranking process, causing truly effective content to be pushed out of the top results. Rather than relying on disabling, a more reliable approach is:
- Clean up duplicate clauses before uploading
- Remove outdated versions
- Do not leave cleaning tasks until after retrieval
3. Chunks Should Follow Contract Structure, Not Just Character Count
Contracts are not ordinary prose. If chunking is too fine, clause context will break; if too coarse, retrieval focus will degrade.
A more reasonable approach is typically:
- Chunk by clause or subsection
- Preserve clause headings
- Aim for each chunk to express one complete rule
For example, “Payment Terms” and “Breach of Contract” should not fall into the same large chunk, nor should a complete clause be artificially split into multiple fragmented pieces.
4. Rewrite Ambiguous Questions First
In contract review, common questions tend to be fairly ambiguous, for example:
- “Is there any risk in this clause?”
- “Does this need to be changed?”
- “Is this contract compliant?”
These types of questions do not perform well when sent directly to retrieval. A better approach is to first use a pre-processing LLM node to rewrite the question into a more specific retrieval request, for example:
- “Check whether the payment terms are consistent with the company template”
- “Locate the content in the contract regarding auto-renewal and breach of contract”
This approach can significantly improve relevance.
7. Step Six: Prompts Must Emphasize “Evidence” and “Boundaries”
One of the biggest risks in contract review scenarios is the model outputting seemingly reasonable but unsupported judgments when materials are insufficient.
Therefore, in the answer prompt, it is recommended to explicitly constrain the following principles:
You are a contract review assistant.
Please analyze strictly based on the provided contract content and review rules.
Requirements:
- Do not fabricate non-existent clauses
- Do not present speculation as conclusions
- Cite the basis for each risk point
- If materials are insufficient, clearly indicate that manual review is needed
- Output should be as structured as possible
If the enterprise wants to further enhance readability, risk levels can also be added, such as:
- High risk
- Medium risk
- Low risk
- Insufficient information
8. Step Seven: Build a Small-Scale Test Set
Before going live, we recommend starting with a single contract category for testing, such as:
- NDA
- Standard procurement contract
- Service contract
Then prepare 10 to 20 test samples covering the following situations:
- Standard template versions
- Manually modified versions
- Versions with obvious risk clauses
- Versions with incomplete information or poor OCR quality
During evaluation, focus on:
- Whether contract type identification is correct
- Whether key clauses can be stably extracted
- Whether risk alerts have supporting evidence
- Whether results are suitable for direct reading by business or legal teams
9. Recommended Deployment Approach
When promoting this type of project internally, we typically do not recommend naming it “AI Contract Review System” in the first phase.
Naming that is more likely to be accepted by business and legal teams includes:
- Contract pre-screening assistant
- Contract information extraction assistant
- Clause risk alert assistant
- Contract template comparison assistant
This type of naming better reflects the current capability boundaries of AI and helps set reasonable expectations within the organization.
Conclusion
When building a contract review assistant with Dify + Knowledge Base, the factors that truly determine effectiveness are not just the model but whether three foundational areas are solidly addressed:
- Whether PDF documents have been cleaned into high-quality text
- Whether review reference materials have been organized into a clear knowledge layer
- Whether retrieval parameters and chunking strategies have been optimized around contract structure
Once these three things are handled well, Dify can effectively support a practical contract pre-screening workflow: first extract information, then locate clauses, then output risk alerts based on rules, and finally return complex judgments to human review.
This is also the approach that enterprises can most easily put into real use today: not having AI directly replace the legal team, but first making AI an efficient screening layer before legal review.