Yaz Hobooti commited on
Commit
e7a28e8
Β·
0 Parent(s):

Increase PDF resolution: DPI from 300 to 600, scaling factors improved for better OCR and barcode detection

Browse files
ProofCheck/.dockerignore ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Git
2
+ .git
3
+ .gitignore
4
+ .gitattributes
5
+
6
+ # Python
7
+ __pycache__
8
+ *.pyc
9
+ *.pyo
10
+ *.pyd
11
+ .Python
12
+ env
13
+ pip-log.txt
14
+ pip-delete-this-directory.txt
15
+ .tox
16
+ .coverage
17
+ .coverage.*
18
+ .cache
19
+ nosetests.xml
20
+ coverage.xml
21
+ *.cover
22
+ *.log
23
+ .git
24
+ .mypy_cache
25
+ .pytest_cache
26
+ .hypothesis
27
+
28
+ # Virtual environments
29
+ venv/
30
+ ENV/
31
+ env/
32
+ .venv/
33
+
34
+ # IDE
35
+ .vscode/
36
+ .idea/
37
+ *.swp
38
+ *.swo
39
+ *~
40
+
41
+ # OS
42
+ .DS_Store
43
+ .DS_Store?
44
+ ._*
45
+ .Spotlight-V100
46
+ .Trashes
47
+ ehthumbs.db
48
+ Thumbs.db
49
+
50
+ # Temporary files
51
+ *.tmp
52
+ *.temp
53
+ uploads/
54
+ results/
55
+ static/results/
56
+
57
+ # Documentation
58
+ README.md
59
+ *.md
60
+ docs/
61
+
62
+ # Test files
63
+ test_*.py
64
+ *_test.py
65
+ tests/
ProofCheck/.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
ProofCheck/Dockerfile ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.9-slim
2
+
3
+ # Set working directory
4
+ WORKDIR /app
5
+
6
+ # Install system dependencies including Tesseract OCR and zbar
7
+ RUN apt-get update && apt-get install -y \
8
+ tesseract-ocr \
9
+ tesseract-ocr-eng \
10
+ tesseract-ocr-fra \
11
+ poppler-utils \
12
+ libzbar0 \
13
+ libgl1 \
14
+ libglib2.0-0 \
15
+ libsm6 \
16
+ libxext6 \
17
+ libxrender1 \
18
+ libgomp1 \
19
+ libgthread-2.0-0 \
20
+ libfontconfig1 \
21
+ && rm -rf /var/lib/apt/lists/*
22
+
23
+ # Copy requirements first for better caching
24
+ COPY requirements.txt .
25
+
26
+ # Install Python dependencies
27
+ RUN pip install --no-cache-dir -r requirements.txt
28
+
29
+ # Download NLTK data
30
+ RUN python -c "import nltk; nltk.download('punkt')"
31
+
32
+ # Copy application files
33
+ COPY . .
34
+
35
+ # Create necessary directories
36
+ RUN mkdir -p uploads results static/results
37
+
38
+ # Expose port
39
+ EXPOSE 7860
40
+
41
+ # Set environment variables
42
+ ENV PYTHONPATH=/app
43
+ ENV FLASK_APP=app.py
44
+
45
+ # Run the application
46
+ CMD ["python", "app.py"]
ProofCheck/README.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: PDF Comparison Tool
3
+ emoji: πŸ“„
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: docker
7
+ pinned: false
8
+ license: mit
9
+ ---
10
+
11
+ # PDF Comparison Tool
12
+
13
+ A comprehensive web-based tool for comparing PDF documents with advanced features including OCR validation, color difference detection, spelling verification, and barcode/QR code detection.
14
+
15
+ ## πŸš€ Live Demo
16
+
17
+ This tool is deployed on Hugging Face Spaces and available for immediate use!
18
+
19
+ ## ✨ Features
20
+
21
+ - **PDF Validation**: Ensures uploaded PDFs contain "50 Carroll" using OCR
22
+ - **Color Difference Detection**: Identifies visual differences between PDFs and highlights them with red boxes
23
+ - **Spelling Verification**: Checks text against both English and French dictionaries
24
+ - **Barcode/QR Code Detection**: Automatically detects and reads barcodes and QR codes
25
+ - **Visual Comparison**: Side-by-side comparison with annotated differences
26
+ - **Modern Web Interface**: Responsive design with Bootstrap and custom styling
27
+
28
+ ## πŸ“‹ Requirements
29
+
30
+ - Both PDF files must contain the text "50 Carroll" for validation
31
+ - Maximum file size: 16MB per PDF
32
+ - Supported format: PDF only
33
+
34
+ ## 🎯 How to Use
35
+
36
+ 1. **Upload PDFs**: Select two PDF files for comparison
37
+ 2. **Validation**: The tool automatically checks for "50 Carroll" in both documents
38
+ 3. **Processing**: Wait for the analysis to complete (may take a few minutes)
39
+ 4. **Results**: View findings in three organized tabs:
40
+ - **Visual Comparison**: Side-by-side view with red boxes highlighting differences
41
+ - **Spelling Issues**: Table of spelling errors with suggestions from English and French dictionaries
42
+ - **Barcodes & QR Codes**: List of detected barcodes with their data and positions
43
+
44
+ ## πŸ”§ Technical Details
45
+
46
+ ### Backend Technologies
47
+ - **Python Flask**: Web framework
48
+ - **OpenCV**: Image processing and comparison
49
+ - **Tesseract OCR**: Text extraction from PDFs
50
+ - **scikit-image**: Structural similarity analysis
51
+ - **pyspellchecker**: Spelling verification
52
+ - **pyzbar**: Barcode and QR code detection
53
+
54
+ ### Frontend Technologies
55
+ - **HTML5/CSS3**: Modern responsive design
56
+ - **JavaScript**: Dynamic content and AJAX requests
57
+ - **Bootstrap**: UI framework for professional appearance
58
+
59
+ ### Comparison Algorithms
60
+ - **Color Difference**: Uses Structural Similarity Index (SSIM) for pixel-level comparison
61
+ - **Text Analysis**: OCR-based text extraction with multi-language spell checking
62
+ - **Barcode Detection**: Automatic recognition of various barcode and QR code formats
63
+
64
+ ## πŸ› οΈ Local Development
65
+
66
+ If you want to run this tool locally:
67
+
68
+ ```bash
69
+ # Clone the repository
70
+ git clone https://huggingface.co/spaces/Digitaljoint/ProofCheck
71
+
72
+ # Install dependencies
73
+ pip install -r requirements.txt
74
+
75
+ # Install Tesseract OCR
76
+ # macOS: brew install tesseract
77
+ # Ubuntu: sudo apt-get install tesseract-ocr
78
+
79
+ # Run the application
80
+ python app.py
81
+ ```
82
+
83
+ ## πŸ“Š Output Examples
84
+
85
+ ### Visual Comparison
86
+ - Red rectangles highlight color differences between PDFs
87
+ - Side-by-side view for easy comparison
88
+ - Page-by-page analysis
89
+
90
+ ### Spelling Issues
91
+ - Word-by-word analysis against English and French dictionaries
92
+ - Spelling suggestions for both languages
93
+ - Organized table format with original text and corrections
94
+
95
+ ### Barcode/QR Code Detection
96
+ - Automatic detection of various barcode formats
97
+ - Extracted data display
98
+ - Position information for each detected code
99
+
100
+ ## πŸ”’ Privacy & Security
101
+
102
+ - All processing happens locally on the server
103
+ - No data is stored permanently
104
+ - Files are automatically cleaned up after processing
105
+ - No external API calls or data sharing
106
+
107
+ ## 🀝 Contributing
108
+
109
+ This tool is open source and contributions are welcome! Please feel free to submit issues or pull requests.
110
+
111
+ ## πŸ“„ License
112
+
113
+ This project is available under the MIT License.
114
+
115
+ ---
116
+
117
+ **Note**: This tool is specifically designed to validate PDFs containing "50 Carroll" and will reject files that don't contain this text. This ensures that only relevant documents are processed for comparison.
ProofCheck/app.py ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import uuid
3
+ import json
4
+ from flask import Flask, request, render_template, jsonify, send_file
5
+ from werkzeug.utils import secure_filename
6
+ from pdf_comparator import PDFComparator
7
+ import tempfile
8
+ import shutil
9
+
10
+ app = Flask(__name__)
11
+ app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024 # 16MB max file size
12
+ app.config['UPLOAD_FOLDER'] = 'uploads'
13
+ app.config['RESULTS_FOLDER'] = 'results'
14
+
15
+ # Ensure directories exist
16
+ os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)
17
+ os.makedirs(app.config['RESULTS_FOLDER'], exist_ok=True)
18
+ os.makedirs('static/results', exist_ok=True)
19
+
20
+ ALLOWED_EXTENSIONS = {'pdf'}
21
+
22
+ def allowed_file(filename):
23
+ return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
24
+
25
+ @app.route('/')
26
+ def index():
27
+ return render_template('index.html')
28
+
29
+ @app.route('/upload', methods=['POST'])
30
+ def upload_files():
31
+ if 'pdf1' not in request.files or 'pdf2' not in request.files:
32
+ return jsonify({'error': 'Both PDF files are required'}), 400
33
+
34
+ pdf1 = request.files['pdf1']
35
+ pdf2 = request.files['pdf2']
36
+
37
+ if pdf1.filename == '' or pdf2.filename == '':
38
+ return jsonify({'error': 'Both PDF files are required'}), 400
39
+
40
+ if not (allowed_file(pdf1.filename) and allowed_file(pdf2.filename)):
41
+ return jsonify({'error': 'Only PDF files are allowed'}), 400
42
+
43
+ # Create unique session directory
44
+ session_id = str(uuid.uuid4())
45
+ session_dir = os.path.join(app.config['UPLOAD_FOLDER'], session_id)
46
+ os.makedirs(session_dir, exist_ok=True)
47
+
48
+ # Save uploaded files
49
+ pdf1_path = os.path.join(session_dir, secure_filename(pdf1.filename))
50
+ pdf2_path = os.path.join(session_dir, secure_filename(pdf2.filename))
51
+
52
+ pdf1.save(pdf1_path)
53
+ pdf2.save(pdf2_path)
54
+
55
+ try:
56
+ # Initialize PDF comparator
57
+ comparator = PDFComparator()
58
+
59
+ # Perform comparison
60
+ results = comparator.compare_pdfs(pdf1_path, pdf2_path, session_id)
61
+
62
+ # Save results
63
+ results_path = os.path.join(app.config['RESULTS_FOLDER'], f'{session_id}_results.json')
64
+ with open(results_path, 'w') as f:
65
+ json.dump(results, f, indent=2)
66
+
67
+ return jsonify({
68
+ 'success': True,
69
+ 'session_id': session_id,
70
+ 'results': results
71
+ })
72
+
73
+ except Exception as e:
74
+ return jsonify({'error': str(e)}), 500
75
+
76
+ @app.route('/results/<session_id>')
77
+ def get_results(session_id):
78
+ results_path = os.path.join(app.config['RESULTS_FOLDER'], f'{session_id}_results.json')
79
+
80
+ if not os.path.exists(results_path):
81
+ return jsonify({'error': 'Results not found'}), 404
82
+
83
+ with open(results_path, 'r') as f:
84
+ results = json.load(f)
85
+
86
+ return jsonify(results)
87
+
88
+ @app.route('/download/<session_id>/<filename>')
89
+ def download_file(session_id, filename):
90
+ file_path = os.path.join(app.config['UPLOAD_FOLDER'], session_id, filename)
91
+
92
+ if not os.path.exists(file_path):
93
+ return jsonify({'error': 'File not found'}), 404
94
+
95
+ return send_file(file_path, as_attachment=True)
96
+
97
+ # For Hugging Face Spaces deployment
98
+ if __name__ == '__main__':
99
+ app.run(debug=True, host='0.0.0.0', port=7860)
ProofCheck/pdf_comparator.py ADDED
@@ -0,0 +1,1938 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import cv2
3
+ import numpy as np
4
+ from PIL import Image, ImageDraw, ImageFont
5
+ import pytesseract
6
+ from pdf2image import convert_from_path
7
+ from pyzbar.pyzbar import decode
8
+ from spellchecker import SpellChecker
9
+ import nltk
10
+ from skimage.metrics import structural_similarity as ssim
11
+ from skimage import color
12
+ import json
13
+ import tempfile
14
+ import shutil
15
+ import re
16
+ import time
17
+ import signal
18
+ import unicodedata
19
+
20
+ # Safe import for regex with fallback
21
+ try:
22
+ import regex as _re
23
+ _USE_REGEX = True
24
+ except ImportError:
25
+ import re as _re
26
+ _USE_REGEX = False
27
+
28
+ TOKEN_PATTERN = r"(?:\p{L})(?:[\p{L}'-]{1,})" if _USE_REGEX else r"[A-Za-z][A-Za-z'-]{1,}"
29
+
30
+ # Domain whitelist for spell checking
31
+ DOMAIN_WHITELIST = {
32
+ # units / abbreviations
33
+ "mg", "mg/g", "ml", "g", "thc", "cbd", "tcm", "mct",
34
+ # common packaging terms / bilingual words you expect
35
+ "gouttes", "tennir", "net", "zoom", "tytann", "dome", "drops",
36
+ # brand or proper names you want to ignore completely
37
+ "purified", "brands", "tytann", "dome", "drops",
38
+ }
39
+ # lowercase everything in whitelist for comparisons
40
+ DOMAIN_WHITELIST = {w.lower() for w in DOMAIN_WHITELIST}
41
+
42
+ def _likely_french(token: str) -> bool:
43
+ """Helper: quick language guess per token"""
44
+ if _USE_REGEX:
45
+ # any Latin letter outside ASCII => probably FR (Γ©, Γ¨, ç…)
46
+ return bool(_re.search(r"[\p{Letter}&&\p{Latin}&&[^A-Za-z]]", token))
47
+ # fallback: any non-ascii letter
48
+ return any((not ('a' <= c.lower() <= 'z')) and c.isalpha() for c in token)
49
+
50
+ # Try to import additional barcode libraries
51
+ try:
52
+ import zxing
53
+ ZXING_AVAILABLE = True
54
+ except ImportError:
55
+ ZXING_AVAILABLE = False
56
+ print("zxing-cpp not available, using pyzbar only")
57
+
58
+ try:
59
+ from dbr import BarcodeReader
60
+ DBR_AVAILABLE = True
61
+ print("Dynamsoft Barcode Reader available")
62
+ except ImportError:
63
+ DBR_AVAILABLE = False
64
+ print("Dynamsoft Barcode Reader not available")
65
+
66
+ class TimeoutError(Exception):
67
+ pass
68
+
69
+ def timeout_handler(signum, frame):
70
+ raise TimeoutError("Operation timed out")
71
+
72
+ class PDFComparator:
73
+ def __init__(self):
74
+ # Initialize spell checkers for English and French
75
+ self.english_spellchecker = SpellChecker(language='en')
76
+ self.french_spellchecker = SpellChecker(language='fr')
77
+
78
+ # Add domain whitelist words to spell checkers
79
+ for w in DOMAIN_WHITELIST:
80
+ self.english_spellchecker.word_frequency.add(w)
81
+ self.french_spellchecker.word_frequency.add(w)
82
+
83
+ # Download required NLTK data
84
+ try:
85
+ nltk.data.find('tokenizers/punkt')
86
+ except LookupError:
87
+ nltk.download('punkt')
88
+
89
+ def safe_execute(self, func, *args, timeout=30, **kwargs):
90
+ """Execute a function with timeout protection"""
91
+ try:
92
+ # Set timeout signal
93
+ signal.signal(signal.SIGALRM, timeout_handler)
94
+ signal.alarm(timeout)
95
+
96
+ # Execute function
97
+ result = func(*args, **kwargs)
98
+
99
+ # Cancel timeout
100
+ signal.alarm(0)
101
+ return result
102
+
103
+ except TimeoutError:
104
+ print(f"Function {func.__name__} timed out after {timeout} seconds")
105
+ return None
106
+ except Exception as e:
107
+ print(f"Error in {func.__name__}: {str(e)}")
108
+ return None
109
+ finally:
110
+ signal.alarm(0)
111
+
112
+ def validate_pdf(self, pdf_path):
113
+ """Validate that PDF contains '50 Carroll' using enhanced OCR for tiny fonts"""
114
+ try:
115
+ print(f"Validating PDF: {pdf_path}")
116
+
117
+ # Try multiple DPI settings for better tiny font detection
118
+ dpi_settings = [300, 400, 600, 800]
119
+
120
+ for dpi in dpi_settings:
121
+ print(f"Trying DPI {dpi} for tiny font detection...")
122
+
123
+ # Convert PDF to images with current DPI
124
+ images = convert_from_path(pdf_path, dpi=dpi)
125
+ print(f"Converted PDF to {len(images)} images at {dpi} DPI")
126
+
127
+ for page_num, image in enumerate(images):
128
+ print(f"Processing page {page_num + 1} at {dpi} DPI...")
129
+
130
+ # Convert PIL image to OpenCV format
131
+ opencv_image = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)
132
+
133
+ # Enhanced preprocessing for tiny fonts
134
+ processed_image = self.enhance_image_for_tiny_fonts(opencv_image)
135
+
136
+ # Try multiple OCR configurations
137
+ ocr_configs = [
138
+ '--oem 3 --psm 6', # Assume uniform block of text
139
+ '--oem 3 --psm 8', # Single word
140
+ '--oem 3 --psm 13', # Raw line
141
+ '--oem 1 --psm 6', # Legacy engine
142
+ '--oem 3 --psm 3', # Fully automatic page segmentation
143
+ ]
144
+
145
+ for config in ocr_configs:
146
+ try:
147
+ # Perform OCR with current configuration
148
+ text = pytesseract.image_to_string(processed_image, config=config)
149
+
150
+ # Debug: Show first 300 characters of extracted text
151
+ debug_text = text[:300].replace('\n', ' ').replace('\r', ' ')
152
+ print(f"Page {page_num + 1} text (DPI {dpi}, config: {config}): '{debug_text}...'")
153
+
154
+ # Check for "50 Carroll" with various patterns
155
+ patterns = ["50 Carroll", "50 carroll", "50Carroll", "50carroll", "50 Carroll", "50 carroll"]
156
+ for pattern in patterns:
157
+ if pattern in text or pattern.lower() in text.lower():
158
+ print(f"Found '{pattern}' in page {page_num + 1} (DPI {dpi}, config: {config})")
159
+ return True
160
+
161
+ except Exception as ocr_error:
162
+ print(f"OCR error with config {config}: {str(ocr_error)}")
163
+ continue
164
+
165
+ print("Validation failed: '50 Carroll' not found in any page with any DPI or OCR config")
166
+ return False
167
+
168
+ except Exception as e:
169
+ print(f"Error validating PDF: {str(e)}")
170
+ raise Exception(f"Error validating PDF: {str(e)}")
171
+
172
+ def enhance_image_for_tiny_fonts(self, image):
173
+ """Enhance image specifically for tiny font OCR"""
174
+ try:
175
+ # Convert to grayscale
176
+ gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
177
+
178
+ # Apply CLAHE (Contrast Limited Adaptive Histogram Equalization)
179
+ clahe = cv2.createCLAHE(clipLimit=3.0, tileGridSize=(8,8))
180
+ enhanced = clahe.apply(gray)
181
+
182
+ # Apply bilateral filter to reduce noise while preserving edges
183
+ denoised = cv2.bilateralFilter(enhanced, 9, 75, 75)
184
+
185
+ # Apply unsharp masking to enhance edges
186
+ gaussian = cv2.GaussianBlur(denoised, (0, 0), 2.0)
187
+ unsharp_mask = cv2.addWeighted(denoised, 1.5, gaussian, -0.5, 0)
188
+
189
+ # Apply adaptive thresholding
190
+ thresh = cv2.adaptiveThreshold(unsharp_mask, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
191
+
192
+ # Apply morphological operations to clean up
193
+ kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 1))
194
+ cleaned = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)
195
+
196
+ return cleaned
197
+
198
+ except Exception as e:
199
+ print(f"Error enhancing image for tiny fonts: {str(e)}")
200
+ return image
201
+
202
+ def extract_text_from_pdf(self, pdf_path):
203
+ """Extract text from PDF with multi-color text detection."""
204
+ try:
205
+ # Try to extract embedded text first
206
+ embedded_text = ""
207
+ try:
208
+ import fitz # PyMuPDF
209
+ doc = fitz.open(pdf_path)
210
+ all_text = []
211
+ any_text = False
212
+ for i, page in enumerate(doc):
213
+ t = page.get_text()
214
+ any_text |= bool(t.strip())
215
+ all_text.append({"page": i+1, "text": t, "image": None})
216
+ doc.close()
217
+ if any_text:
218
+ # render images for color diff/barcode only when needed
219
+ images = convert_from_path(pdf_path, dpi=600)
220
+ for d, im in zip(all_text, images):
221
+ d["image"] = im
222
+ return all_text
223
+ except Exception:
224
+ pass
225
+
226
+ # Enhanced OCR path with multi-color text detection
227
+ print("Extracting text with multi-color detection...")
228
+ images = convert_from_path(pdf_path, dpi=600)
229
+ all_text = []
230
+
231
+ for page_num, image in enumerate(images):
232
+ opencv_image = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)
233
+
234
+ # Multi-color text extraction
235
+ combined_text = self.extract_multi_color_text(opencv_image)
236
+
237
+ all_text.append({
238
+ 'page': page_num + 1,
239
+ 'text': combined_text,
240
+ 'image': image
241
+ })
242
+
243
+ return all_text
244
+
245
+ except Exception as e:
246
+ raise Exception(f"Error extracting text from PDF: {str(e)}")
247
+
248
+ def extract_multi_color_text(self, image):
249
+ """Extract text from image in various colors using multiple preprocessing methods."""
250
+ try:
251
+ combined_text = ""
252
+
253
+ # Method 1: Standard black text detection
254
+ print("Method 1: Standard black text detection")
255
+ processed_image = self.enhance_image_for_tiny_fonts(image)
256
+ text1 = self.ocr_with_multiple_configs(processed_image)
257
+ combined_text += text1 + " "
258
+
259
+ # Method 2: Inverted text detection (for white text on dark background)
260
+ print("Method 2: Inverted text detection")
261
+ inverted_image = self.create_inverted_image(image)
262
+ text2 = self.ocr_with_multiple_configs(inverted_image)
263
+ combined_text += text2 + " "
264
+
265
+ # Method 3: Color channel separation for colored text
266
+ print("Method 3: Color channel separation")
267
+ for channel_name, channel_image in self.extract_color_channels(image):
268
+ text3 = self.ocr_with_multiple_configs(channel_image)
269
+ combined_text += text3 + " "
270
+
271
+ # Method 4: Edge-based text detection
272
+ print("Method 4: Edge-based text detection")
273
+ edge_image = self.create_edge_enhanced_image(image)
274
+ text4 = self.ocr_with_multiple_configs(edge_image)
275
+ combined_text += text4 + " "
276
+
277
+ return combined_text.strip()
278
+
279
+ except Exception as e:
280
+ print(f"Error in multi-color text extraction: {str(e)}")
281
+ return ""
282
+
283
+ def create_inverted_image(self, image):
284
+ """Create inverted image for white text detection."""
285
+ try:
286
+ # Convert to grayscale
287
+ gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
288
+
289
+ # Invert the image
290
+ inverted = cv2.bitwise_not(gray)
291
+
292
+ # Apply CLAHE for better contrast
293
+ clahe = cv2.createCLAHE(clipLimit=3.0, tileGridSize=(8,8))
294
+ enhanced = clahe.apply(inverted)
295
+
296
+ # Apply thresholding
297
+ _, thresh = cv2.threshold(enhanced, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
298
+
299
+ return thresh
300
+
301
+ except Exception as e:
302
+ print(f"Error creating inverted image: {str(e)}")
303
+ return image
304
+
305
+ def extract_color_channels(self, image):
306
+ """Extract individual color channels for colored text detection."""
307
+ try:
308
+ channels = []
309
+
310
+ # Convert to different color spaces
311
+ hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
312
+ lab = cv2.cvtColor(image, cv2.COLOR_BGR2LAB)
313
+
314
+ # Extract individual channels
315
+ b, g, r = cv2.split(image)
316
+ h, s, v = cv2.split(hsv)
317
+ l, a, b_lab = cv2.split(lab)
318
+
319
+ # Create channel images for OCR
320
+ channel_images = [
321
+ ("blue", b),
322
+ ("green", g),
323
+ ("red", r),
324
+ ("hue", h),
325
+ ("saturation", s),
326
+ ("value", v),
327
+ ("lightness", l)
328
+ ]
329
+
330
+ for name, channel in channel_images:
331
+ # Apply thresholding to each channel
332
+ _, thresh = cv2.threshold(channel, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
333
+ channels.append((name, thresh))
334
+
335
+ return channels
336
+
337
+ except Exception as e:
338
+ print(f"Error extracting color channels: {str(e)}")
339
+ return []
340
+
341
+ def create_edge_enhanced_image(self, image):
342
+ """Create edge-enhanced image for text detection."""
343
+ try:
344
+ # Convert to grayscale
345
+ gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
346
+
347
+ # Apply edge detection
348
+ edges = cv2.Canny(gray, 50, 150)
349
+
350
+ # Dilate edges to connect text components
351
+ kernel = np.ones((2, 2), np.uint8)
352
+ dilated = cv2.dilate(edges, kernel, iterations=1)
353
+
354
+ # Invert to get white text on black background
355
+ inverted = cv2.bitwise_not(dilated)
356
+
357
+ return inverted
358
+
359
+ except Exception as e:
360
+ print(f"Error creating edge-enhanced image: {str(e)}")
361
+ return image
362
+
363
+ def ocr_with_multiple_configs(self, image):
364
+ """Perform OCR with multiple configurations."""
365
+ try:
366
+ ocr_configs = [
367
+ '--oem 3 --psm 6', # Assume uniform block of text
368
+ '--oem 3 --psm 8', # Single word
369
+ '--oem 3 --psm 13', # Raw line
370
+ '--oem 1 --psm 6', # Legacy engine
371
+ ]
372
+
373
+ best_text = ""
374
+ for config in ocr_configs:
375
+ try:
376
+ text = pytesseract.image_to_string(image, config=config)
377
+ if len(text.strip()) > len(best_text.strip()):
378
+ best_text = text
379
+ except Exception as ocr_error:
380
+ print(f"OCR error with config {config}: {str(ocr_error)}")
381
+ continue
382
+
383
+ return best_text
384
+
385
+ except Exception as e:
386
+ print(f"Error in OCR with multiple configs: {str(e)}")
387
+ return ""
388
+
389
+ def annotate_spelling_errors_on_image(self, pil_image, misspelled):
390
+ """
391
+ Draw one red rectangle around each misspelled token using Tesseract word boxes.
392
+ 'misspelled' must be a list of dicts with 'word' keys (from check_spelling).
393
+ """
394
+ if not misspelled:
395
+ return pil_image
396
+
397
+ def _norm(s: str) -> str:
398
+ return unicodedata.normalize("NFKC", s).replace("'","'").strip(".,:;!?)(").lower()
399
+
400
+ # build a quick lookup of misspelled lowercase words
401
+ miss_set = {_norm(m["word"]) for m in misspelled}
402
+
403
+ # run word-level OCR to get boxes
404
+ img = pil_image
405
+ try:
406
+ data = pytesseract.image_to_data(
407
+ img,
408
+ lang="eng+fra",
409
+ config="--oem 3 --psm 6",
410
+ output_type=pytesseract.Output.DICT,
411
+ )
412
+ except Exception as e:
413
+ print("image_to_data failed:", e)
414
+ return img
415
+
416
+ draw = ImageDraw.Draw(img)
417
+ n = len(data.get("text", []))
418
+ for i in range(n):
419
+ word = (data["text"][i] or "").strip()
420
+ if not word:
421
+ continue
422
+ clean = _norm(word)
423
+
424
+ if clean and clean in miss_set:
425
+ x, y, w, h = data["left"][i], data["top"][i], data["width"][i], data["height"][i]
426
+ # draw a distinct box for this one word
427
+ draw.rectangle([x, y, x + w, y + h], outline="red", width=4)
428
+
429
+ return img
430
+
431
+ def detect_barcodes_qr_codes(self, image):
432
+ """Detect and decode barcodes and QR codes with timeout protection"""
433
+ try:
434
+ print("Starting barcode detection...")
435
+ start_time = time.time()
436
+
437
+ # Convert PIL image to OpenCV format
438
+ opencv_image = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)
439
+
440
+ all_barcodes = []
441
+
442
+ # Method 1: Basic pyzbar detection (fastest)
443
+ print("Method 1: Basic pyzbar detection")
444
+ pyzbar_results = self.detect_with_pyzbar_basic(opencv_image)
445
+ if pyzbar_results:
446
+ all_barcodes.extend(pyzbar_results)
447
+ print(f"Found {len(pyzbar_results)} barcodes with basic pyzbar")
448
+
449
+ # Method 2: Dynamsoft Barcode Reader (if available)
450
+ if DBR_AVAILABLE:
451
+ print("Method 2: Dynamsoft Barcode Reader")
452
+ dbr_results = self.detect_with_dynamsoft(opencv_image)
453
+ if dbr_results:
454
+ all_barcodes.extend(dbr_results)
455
+ print(f"Found {len(dbr_results)} barcodes with Dynamsoft")
456
+
457
+ # Method 3: Enhanced preprocessing (always run for better detection)
458
+ print("Method 3: Enhanced preprocessing")
459
+ enhanced_results = self.detect_with_enhanced_preprocessing(opencv_image)
460
+ if enhanced_results:
461
+ all_barcodes.extend(enhanced_results)
462
+ print(f"Found {len(enhanced_results)} additional barcodes with enhanced preprocessing")
463
+
464
+ # Method 4: Small barcode detection (always run for better detection)
465
+ print("Method 4: Small barcode detection")
466
+ small_results = self.detect_small_barcodes_simple(opencv_image)
467
+ if small_results:
468
+ all_barcodes.extend(small_results)
469
+ print(f"Found {len(small_results)} additional small barcodes")
470
+
471
+ # Remove duplicates
472
+ unique_barcodes = self.remove_duplicate_barcodes(all_barcodes)
473
+
474
+ # Enhance results
475
+ enhanced_barcodes = self.enhance_barcode_data(unique_barcodes)
476
+
477
+ elapsed_time = time.time() - start_time
478
+ print(f"Barcode detection completed in {elapsed_time:.2f} seconds. Found {len(enhanced_barcodes)} unique barcodes.")
479
+
480
+ return enhanced_barcodes
481
+
482
+ except Exception as e:
483
+ print(f"Error in barcode detection: {str(e)}")
484
+ return []
485
+
486
+ def detect_with_pyzbar_basic(self, image):
487
+ """Basic pyzbar detection without complex preprocessing"""
488
+ results = []
489
+
490
+ try:
491
+ # Simple grayscale conversion
492
+ gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
493
+
494
+ # Try original image
495
+ decoded_objects = decode(gray)
496
+ for obj in decoded_objects:
497
+ barcode_info = {
498
+ 'type': obj.type,
499
+ 'data': obj.data.decode('utf-8', errors='ignore'),
500
+ 'rect': obj.rect,
501
+ 'polygon': obj.polygon,
502
+ 'quality': getattr(obj, 'quality', 0),
503
+ 'orientation': self.detect_barcode_orientation(obj),
504
+ 'method': 'pyzbar_basic'
505
+ }
506
+
507
+ if 'databar' in obj.type.lower():
508
+ barcode_info['expanded_data'] = self.parse_databar_expanded(obj.data.decode('utf-8', errors='ignore'))
509
+
510
+ results.append(barcode_info)
511
+
512
+ # Try with simple contrast enhancement
513
+ clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
514
+ enhanced = clahe.apply(gray)
515
+ decoded_objects = decode(enhanced)
516
+
517
+ for obj in decoded_objects:
518
+ barcode_info = {
519
+ 'type': obj.type,
520
+ 'data': obj.data.decode('utf-8', errors='ignore'),
521
+ 'rect': obj.rect,
522
+ 'polygon': obj.polygon,
523
+ 'quality': getattr(obj, 'quality', 0),
524
+ 'orientation': self.detect_barcode_orientation(obj),
525
+ 'method': 'pyzbar_enhanced'
526
+ }
527
+
528
+ if 'databar' in obj.type.lower():
529
+ barcode_info['expanded_data'] = self.parse_databar_expanded(obj.data.decode('utf-8', errors='ignore'))
530
+
531
+ results.append(barcode_info)
532
+
533
+ except Exception as e:
534
+ print(f"Error in basic pyzbar detection: {str(e)}")
535
+
536
+ return results
537
+
538
+ def detect_with_dynamsoft(self, image):
539
+ """Detect barcodes using Dynamsoft Barcode Reader"""
540
+ results = []
541
+
542
+ try:
543
+ if not DBR_AVAILABLE:
544
+ return results
545
+
546
+ # Initialize Dynamsoft Barcode Reader
547
+ reader = BarcodeReader()
548
+
549
+ # Convert OpenCV image to bytes for Dynamsoft
550
+ success, buffer = cv2.imencode('.png', image)
551
+ if not success:
552
+ print("Failed to encode image for Dynamsoft")
553
+ return results
554
+
555
+ image_bytes = buffer.tobytes()
556
+
557
+ # Decode barcodes
558
+ text_results = reader.decode_file_stream(image_bytes)
559
+
560
+ for result in text_results:
561
+ barcode_info = {
562
+ 'type': result.barcode_format_string,
563
+ 'data': result.barcode_text,
564
+ 'rect': type('Rect', (), {
565
+ 'left': result.localization_result.x1,
566
+ 'top': result.localization_result.y1,
567
+ 'width': result.localization_result.x2 - result.localization_result.x1,
568
+ 'height': result.localization_result.y2 - result.localization_result.y1
569
+ })(),
570
+ 'polygon': [
571
+ (result.localization_result.x1, result.localization_result.y1),
572
+ (result.localization_result.x2, result.localization_result.y1),
573
+ (result.localization_result.x2, result.localization_result.y2),
574
+ (result.localization_result.x1, result.localization_result.y2)
575
+ ],
576
+ 'quality': result.confidence,
577
+ 'orientation': self.detect_barcode_orientation(result),
578
+ 'method': 'dynamsoft'
579
+ }
580
+
581
+ # Enhanced DataBar Expanded detection
582
+ if 'databar' in result.barcode_format_string.lower() or 'expanded' in result.barcode_format_string.lower():
583
+ barcode_info['expanded_data'] = self.parse_databar_expanded(result.barcode_text)
584
+
585
+ results.append(barcode_info)
586
+
587
+ print(f"Dynamsoft detected {len(results)} barcodes")
588
+
589
+ except Exception as e:
590
+ print(f"Error in Dynamsoft detection: {str(e)}")
591
+
592
+ return results
593
+
594
+ def detect_with_enhanced_preprocessing(self, image):
595
+ """Enhanced preprocessing with limited methods"""
596
+ results = []
597
+
598
+ try:
599
+ gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
600
+
601
+ # Limited preprocessing methods
602
+ processed_images = [
603
+ gray, # Original
604
+ cv2.resize(gray, (gray.shape[1] * 3, gray.shape[0] * 3), interpolation=cv2.INTER_CUBIC), # 3x scale
605
+ cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2), # Adaptive threshold
606
+ ]
607
+
608
+ for i, processed_image in enumerate(processed_images):
609
+ try:
610
+ decoded_objects = decode(processed_image)
611
+
612
+ for obj in decoded_objects:
613
+ barcode_info = {
614
+ 'type': obj.type,
615
+ 'data': obj.data.decode('utf-8', errors='ignore'),
616
+ 'rect': obj.rect,
617
+ 'polygon': obj.polygon,
618
+ 'quality': getattr(obj, 'quality', 0),
619
+ 'orientation': self.detect_barcode_orientation(obj),
620
+ 'method': f'enhanced_preprocessing_{i}'
621
+ }
622
+
623
+ if 'databar' in obj.type.lower():
624
+ barcode_info['expanded_data'] = self.parse_databar_expanded(obj.data.decode('utf-8', errors='ignore'))
625
+
626
+ results.append(barcode_info)
627
+
628
+ except Exception as e:
629
+ print(f"Error in enhanced preprocessing method {i}: {str(e)}")
630
+ continue
631
+
632
+ except Exception as e:
633
+ print(f"Error in enhanced preprocessing: {str(e)}")
634
+
635
+ return results
636
+
637
+ def detect_small_barcodes_simple(self, image):
638
+ """Simplified small barcode detection"""
639
+ results = []
640
+
641
+ try:
642
+ gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
643
+
644
+ # Only try 3x and 4x scaling
645
+ scale_factors = [3.0, 4.0]
646
+
647
+ for scale in scale_factors:
648
+ try:
649
+ height, width = gray.shape
650
+ new_height, new_width = int(height * scale), int(width * scale)
651
+ scaled = cv2.resize(gray, (new_width, new_height), interpolation=cv2.INTER_CUBIC)
652
+
653
+ decoded_objects = decode(scaled)
654
+
655
+ for obj in decoded_objects:
656
+ # Scale back coordinates
657
+ scale_factor = width / new_width
658
+ scaled_rect = type('Rect', (), {
659
+ 'left': int(obj.rect.left * scale_factor),
660
+ 'top': int(obj.rect.top * scale_factor),
661
+ 'width': int(obj.rect.width * scale_factor),
662
+ 'height': int(obj.rect.height * scale_factor)
663
+ })()
664
+
665
+ barcode_info = {
666
+ 'type': obj.type,
667
+ 'data': obj.data.decode('utf-8', errors='ignore'),
668
+ 'rect': scaled_rect,
669
+ 'polygon': obj.polygon,
670
+ 'quality': getattr(obj, 'quality', 0),
671
+ 'orientation': self.detect_barcode_orientation(obj),
672
+ 'method': f'small_barcode_{scale}x',
673
+ 'size_category': 'small'
674
+ }
675
+
676
+ if 'databar' in obj.type.lower():
677
+ barcode_info['expanded_data'] = self.parse_databar_expanded(obj.data.decode('utf-8', errors='ignore'))
678
+
679
+ results.append(barcode_info)
680
+
681
+ except Exception as e:
682
+ print(f"Error in small barcode detection at {scale}x: {str(e)}")
683
+ continue
684
+
685
+ except Exception as e:
686
+ print(f"Error in small barcode detection: {str(e)}")
687
+
688
+ return results
689
+
690
+ def preprocess_image_for_ocr(self, image):
691
+ """Preprocess image for better OCR results"""
692
+ try:
693
+ # Convert to grayscale
694
+ gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
695
+
696
+ # Apply different preprocessing techniques
697
+
698
+ # 1. Resize image to improve small text recognition
699
+ height, width = gray.shape
700
+ scale_factor = 3.0 # Scale up for better small font recognition
701
+ new_height, new_width = int(height * scale_factor), int(width * scale_factor)
702
+ resized = cv2.resize(gray, (new_width, new_height), interpolation=cv2.INTER_CUBIC)
703
+
704
+ # 2. Apply Gaussian blur to reduce noise
705
+ blurred = cv2.GaussianBlur(resized, (1, 1), 0)
706
+
707
+ # 3. Apply adaptive thresholding for better text separation
708
+ thresh = cv2.adaptiveThreshold(blurred, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
709
+
710
+ # 4. Apply morphological operations to clean up text
711
+ kernel = np.ones((1, 1), np.uint8)
712
+ cleaned = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)
713
+
714
+ # 5. Apply contrast enhancement
715
+ clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
716
+ enhanced = clahe.apply(cleaned)
717
+
718
+ return enhanced
719
+
720
+ except Exception as e:
721
+ print(f"Error preprocessing image: {str(e)}")
722
+ return image # Return original if preprocessing fails
723
+
724
+ def preprocess_for_barcode_detection(self, image):
725
+ """Preprocess image with multiple techniques for better barcode detection"""
726
+ processed_images = [image] # Start with original
727
+
728
+ try:
729
+ # Convert to grayscale
730
+ gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
731
+ processed_images.append(gray)
732
+
733
+ # Apply different preprocessing techniques
734
+
735
+ # 1. Contrast enhancement
736
+ clahe = cv2.createCLAHE(clipLimit=3.0, tileGridSize=(8, 8))
737
+ enhanced = clahe.apply(gray)
738
+ processed_images.append(enhanced)
739
+
740
+ # 2. Gaussian blur for noise reduction
741
+ blurred = cv2.GaussianBlur(gray, (3, 3), 0)
742
+ processed_images.append(blurred)
743
+
744
+ # 3. Adaptive thresholding
745
+ thresh = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
746
+ processed_images.append(thresh)
747
+
748
+ # 4. Edge enhancement for better barcode detection
749
+ kernel = np.array([[-1,-1,-1], [-1,9,-1], [-1,-1,-1]])
750
+ sharpened = cv2.filter2D(gray, -1, kernel)
751
+ processed_images.append(sharpened)
752
+
753
+ # 5. Scale up for small barcodes
754
+ height, width = gray.shape
755
+ scale_factor = 3.0
756
+ new_height, new_width = int(height * scale_factor), int(width * scale_factor)
757
+ scaled = cv2.resize(gray, (new_width, new_height), interpolation=cv2.INTER_CUBIC)
758
+ processed_images.append(scaled)
759
+
760
+ except Exception as e:
761
+ print(f"Error in barcode preprocessing: {str(e)}")
762
+
763
+ return processed_images
764
+
765
+ def preprocess_for_databar(self, gray_image):
766
+ """Specialized preprocessing for DataBar Expanded Stacked barcodes"""
767
+ processed_images = []
768
+
769
+ try:
770
+ # Original grayscale
771
+ processed_images.append(gray_image)
772
+
773
+ # 1. High contrast enhancement for DataBar
774
+ clahe = cv2.createCLAHE(clipLimit=4.0, tileGridSize=(8, 8))
775
+ enhanced = clahe.apply(gray_image)
776
+ processed_images.append(enhanced)
777
+
778
+ # 2. Bilateral filter to preserve edges while reducing noise
779
+ bilateral = cv2.bilateralFilter(gray_image, 9, 75, 75)
780
+ processed_images.append(bilateral)
781
+
782
+ # 3. Adaptive thresholding with different parameters
783
+ thresh1 = cv2.adaptiveThreshold(gray_image, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 15, 2)
784
+ processed_images.append(thresh1)
785
+
786
+ thresh2 = cv2.adaptiveThreshold(gray_image, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
787
+ processed_images.append(thresh2)
788
+
789
+ # 4. Scale up for better DataBar detection
790
+ height, width = gray_image.shape
791
+ scale_factors = [2.0, 3.0, 4.0]
792
+
793
+ for scale in scale_factors:
794
+ new_height, new_width = int(height * scale), int(width * scale)
795
+ scaled = cv2.resize(gray_image, (new_width, new_height), interpolation=cv2.INTER_CUBIC)
796
+ processed_images.append(scaled)
797
+
798
+ # 5. Edge enhancement specifically for DataBar
799
+ kernel = np.array([[-1,-1,-1], [-1,9,-1], [-1,-1,-1]])
800
+ sharpened = cv2.filter2D(gray_image, -1, kernel)
801
+ processed_images.append(sharpened)
802
+
803
+ # 6. Morphological operations for DataBar
804
+ kernel = np.ones((2, 2), np.uint8)
805
+ morphed = cv2.morphologyEx(gray_image, cv2.MORPH_CLOSE, kernel)
806
+ processed_images.append(morphed)
807
+
808
+ except Exception as e:
809
+ print(f"Error in DataBar preprocessing: {str(e)}")
810
+
811
+ return processed_images
812
+
813
+ def detect_with_transformations(self, image):
814
+ """Detect barcodes using multiple image transformations"""
815
+ results = []
816
+
817
+ try:
818
+ # Try different rotations
819
+ angles = [0, 90, 180, 270]
820
+
821
+ for angle in angles:
822
+ if angle == 0:
823
+ rotated_image = image
824
+ else:
825
+ height, width = image.shape[:2]
826
+ center = (width // 2, height // 2)
827
+ rotation_matrix = cv2.getRotationMatrix2D(center, angle, 1.0)
828
+ rotated_image = cv2.warpAffine(image, rotation_matrix, (width, height))
829
+
830
+ # Try to detect barcodes in rotated image
831
+ try:
832
+ decoded_objects = decode(rotated_image)
833
+
834
+ for obj in decoded_objects:
835
+ barcode_info = {
836
+ 'type': obj.type,
837
+ 'data': obj.data.decode('utf-8', errors='ignore'),
838
+ 'rect': obj.rect,
839
+ 'polygon': obj.polygon,
840
+ 'quality': getattr(obj, 'quality', 0),
841
+ 'orientation': f"{angle}Β°",
842
+ 'method': f'transform_{angle}deg'
843
+ }
844
+
845
+ # Enhanced DataBar Expanded detection
846
+ if 'databar' in obj.type.lower() or 'expanded' in obj.type.lower():
847
+ barcode_info['expanded_data'] = self.parse_databar_expanded(obj.data.decode('utf-8', errors='ignore'))
848
+
849
+ # Check for multi-stack barcodes
850
+ if self.is_multi_stack_barcode(obj, rotated_image):
851
+ barcode_info['stack_type'] = self.detect_stack_type(obj, rotated_image)
852
+
853
+ results.append(barcode_info)
854
+
855
+ except Exception as e:
856
+ print(f"Error in transformation detection at {angle}Β°: {str(e)}")
857
+ continue
858
+
859
+ except Exception as e:
860
+ print(f"Error in transformation detection: {str(e)}")
861
+
862
+ return results
863
+
864
+ def detect_small_barcodes(self, image):
865
+ """Specialized detection for small barcodes and QR codes"""
866
+ results = []
867
+
868
+ try:
869
+ # Convert to grayscale
870
+ gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
871
+
872
+ # Apply specialized preprocessing for small barcodes
873
+ processed_images = self.preprocess_for_small_barcodes(gray)
874
+
875
+ for processed_image in processed_images:
876
+ try:
877
+ decoded_objects = decode(processed_image)
878
+
879
+ for obj in decoded_objects:
880
+ # Check if this is a small barcode (less than 50x50 pixels)
881
+ if obj.rect.width < 50 or obj.rect.height < 50:
882
+ barcode_info = {
883
+ 'type': obj.type,
884
+ 'data': obj.data.decode('utf-8', errors='ignore'),
885
+ 'rect': obj.rect,
886
+ 'polygon': obj.polygon,
887
+ 'quality': getattr(obj, 'quality', 0),
888
+ 'orientation': self.detect_barcode_orientation(obj),
889
+ 'method': 'small_barcode_detection',
890
+ 'size_category': 'small'
891
+ }
892
+
893
+ # Enhanced DataBar Expanded detection
894
+ if 'databar' in obj.type.lower() or 'expanded' in obj.type.lower():
895
+ barcode_info['expanded_data'] = self.parse_databar_expanded(obj.data.decode('utf-8', errors='ignore'))
896
+
897
+ # Check for multi-stack barcodes
898
+ if self.is_multi_stack_barcode(obj, image):
899
+ barcode_info['stack_type'] = self.detect_stack_type(obj, image)
900
+
901
+ results.append(barcode_info)
902
+
903
+ except Exception as e:
904
+ print(f"Error in small barcode detection: {str(e)}")
905
+ continue
906
+
907
+ except Exception as e:
908
+ print(f"Error in small barcode preprocessing: {str(e)}")
909
+
910
+ return results
911
+
912
+ def preprocess_for_small_barcodes(self, gray_image):
913
+ """Specialized preprocessing for small barcodes and QR codes"""
914
+ processed_images = []
915
+
916
+ try:
917
+ # Original grayscale
918
+ processed_images.append(gray_image)
919
+
920
+ # 1. Multiple high-resolution scaling for small barcodes
921
+ height, width = gray_image.shape
922
+ scale_factors = [4.0, 5.0, 6.0, 8.0] # Higher scaling for small barcodes
923
+
924
+ for scale in scale_factors:
925
+ new_height, new_width = int(height * scale), int(width * scale)
926
+ scaled = cv2.resize(gray_image, (new_width, new_height), interpolation=cv2.INTER_CUBIC)
927
+ processed_images.append(scaled)
928
+
929
+ # 2. Aggressive contrast enhancement
930
+ clahe = cv2.createCLAHE(clipLimit=5.0, tileGridSize=(8, 8))
931
+ enhanced = clahe.apply(gray_image)
932
+ processed_images.append(enhanced)
933
+
934
+ # 3. Unsharp masking for edge enhancement
935
+ gaussian = cv2.GaussianBlur(gray_image, (0, 0), 2.0)
936
+ unsharp = cv2.addWeighted(gray_image, 1.5, gaussian, -0.5, 0)
937
+ processed_images.append(unsharp)
938
+
939
+ # 4. Multiple thresholding methods
940
+ # Otsu's thresholding
941
+ _, otsu = cv2.threshold(gray_image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
942
+ processed_images.append(otsu)
943
+
944
+ # Adaptive thresholding with different parameters
945
+ adaptive1 = cv2.adaptiveThreshold(gray_image, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 9, 2)
946
+ processed_images.append(adaptive1)
947
+
948
+ adaptive2 = cv2.adaptiveThreshold(gray_image, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 7, 2)
949
+ processed_images.append(adaptive2)
950
+
951
+ # 5. Noise reduction with different methods
952
+ # Bilateral filter
953
+ bilateral = cv2.bilateralFilter(gray_image, 9, 75, 75)
954
+ processed_images.append(bilateral)
955
+
956
+ # Median filter
957
+ median = cv2.medianBlur(gray_image, 3)
958
+ processed_images.append(median)
959
+
960
+ # 6. Edge detection and enhancement
961
+ # Sobel edge detection
962
+ sobel_x = cv2.Sobel(gray_image, cv2.CV_64F, 1, 0, ksize=3)
963
+ sobel_y = cv2.Sobel(gray_image, cv2.CV_64F, 0, 1, ksize=3)
964
+ sobel = np.sqrt(sobel_x**2 + sobel_y**2)
965
+ sobel = np.uint8(sobel * 255 / sobel.max())
966
+ processed_images.append(sobel)
967
+
968
+ # 7. Morphological operations for small barcode cleanup
969
+ kernel = np.ones((2, 2), np.uint8)
970
+ morphed_close = cv2.morphologyEx(gray_image, cv2.MORPH_CLOSE, kernel)
971
+ processed_images.append(morphed_close)
972
+
973
+ kernel_open = np.ones((1, 1), np.uint8)
974
+ morphed_open = cv2.morphologyEx(gray_image, cv2.MORPH_OPEN, kernel_open)
975
+ processed_images.append(morphed_open)
976
+
977
+ except Exception as e:
978
+ print(f"Error in small barcode preprocessing: {str(e)}")
979
+
980
+ return processed_images
981
+
982
+ def detect_with_high_resolution(self, image):
983
+ """Detect barcodes using high-resolution processing"""
984
+ results = []
985
+
986
+ try:
987
+ # Convert to grayscale
988
+ gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
989
+
990
+ # Process at multiple high resolutions
991
+ height, width = gray.shape
992
+ resolutions = [
993
+ (int(width * 3), int(height * 3)), # 3x resolution
994
+ (int(width * 4), int(height * 4)), # 4x resolution
995
+ (int(width * 6), int(height * 6)) # 6x resolution
996
+ ]
997
+
998
+ for new_width, new_height in resolutions:
999
+ try:
1000
+ # Resize with high-quality interpolation
1001
+ resized = cv2.resize(gray, (new_width, new_height), interpolation=cv2.INTER_CUBIC)
1002
+
1003
+ # Apply high-resolution preprocessing
1004
+ processed = self.preprocess_high_resolution(resized)
1005
+
1006
+ # Try to detect barcodes
1007
+ decoded_objects = decode(processed)
1008
+
1009
+ for obj in decoded_objects:
1010
+ # Scale back the coordinates to original image size
1011
+ scale_factor = width / new_width
1012
+ scaled_rect = type('Rect', (), {
1013
+ 'left': int(obj.rect.left * scale_factor),
1014
+ 'top': int(obj.rect.top * scale_factor),
1015
+ 'width': int(obj.rect.width * scale_factor),
1016
+ 'height': int(obj.rect.height * scale_factor)
1017
+ })()
1018
+
1019
+ barcode_info = {
1020
+ 'type': obj.type,
1021
+ 'data': obj.data.decode('utf-8', errors='ignore'),
1022
+ 'rect': scaled_rect,
1023
+ 'polygon': obj.polygon,
1024
+ 'quality': getattr(obj, 'quality', 0),
1025
+ 'orientation': self.detect_barcode_orientation(obj),
1026
+ 'method': f'high_res_{new_width}x{new_height}',
1027
+ 'resolution': f'{new_width}x{new_height}'
1028
+ }
1029
+
1030
+ # Enhanced DataBar Expanded detection
1031
+ if 'databar' in obj.type.lower() or 'expanded' in obj.type.lower():
1032
+ barcode_info['expanded_data'] = self.parse_databar_expanded(obj.data.decode('utf-8', errors='ignore'))
1033
+
1034
+ # Check for multi-stack barcodes
1035
+ if self.is_multi_stack_barcode(obj, image):
1036
+ barcode_info['stack_type'] = self.detect_stack_type(obj, image)
1037
+
1038
+ results.append(barcode_info)
1039
+
1040
+ except Exception as e:
1041
+ print(f"Error in high-resolution detection at {new_width}x{new_height}: {str(e)}")
1042
+ continue
1043
+
1044
+ except Exception as e:
1045
+ print(f"Error in high-resolution detection: {str(e)}")
1046
+
1047
+ return results
1048
+
1049
+ def preprocess_high_resolution(self, image):
1050
+ """Preprocessing optimized for high-resolution images"""
1051
+ try:
1052
+ # 1. High-quality noise reduction
1053
+ denoised = cv2.fastNlMeansDenoising(image)
1054
+
1055
+ # 2. Advanced contrast enhancement
1056
+ clahe = cv2.createCLAHE(clipLimit=3.0, tileGridSize=(8, 8))
1057
+ enhanced = clahe.apply(denoised)
1058
+
1059
+ # 3. Edge-preserving smoothing
1060
+ bilateral = cv2.bilateralFilter(enhanced, 9, 75, 75)
1061
+
1062
+ # 4. Sharpening
1063
+ kernel = np.array([[-1,-1,-1], [-1,9,-1], [-1,-1,-1]])
1064
+ sharpened = cv2.filter2D(bilateral, -1, kernel)
1065
+
1066
+ # 5. Adaptive thresholding for high-res
1067
+ thresh = cv2.adaptiveThreshold(sharpened, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
1068
+
1069
+ return thresh
1070
+
1071
+ except Exception as e:
1072
+ print(f"Error in high-resolution preprocessing: {str(e)}")
1073
+ return image
1074
+
1075
+ def detect_barcode_orientation(self, barcode_obj):
1076
+ """Detect the orientation of the barcode"""
1077
+ try:
1078
+ if hasattr(barcode_obj, 'polygon') and len(barcode_obj.polygon) >= 4:
1079
+ # Calculate orientation based on polygon points
1080
+ points = np.array(barcode_obj.polygon)
1081
+ # Calculate the angle of the longest edge
1082
+ edges = []
1083
+ for i in range(4):
1084
+ p1 = points[i]
1085
+ p2 = points[(i + 1) % 4]
1086
+ edge_length = np.linalg.norm(p2 - p1)
1087
+ angle = np.arctan2(p2[1] - p1[1], p2[0] - p1[0]) * 180 / np.pi
1088
+ edges.append((edge_length, angle))
1089
+
1090
+ # Find the longest edge (likely the main barcode direction)
1091
+ longest_edge = max(edges, key=lambda x: x[0])
1092
+ return f"{longest_edge[1]:.1f}Β°"
1093
+
1094
+ return "Unknown"
1095
+ except:
1096
+ return "Unknown"
1097
+
1098
+ def parse_databar_expanded(self, data):
1099
+ """Parse DataBar Expanded barcode data"""
1100
+ try:
1101
+ # DataBar Expanded can contain multiple data fields
1102
+ # Format: [01]12345678901234[3101]123[3102]456
1103
+ parsed_data = {}
1104
+
1105
+ # Extract GS1 Application Identifiers
1106
+ ai_pattern = r'\[(\d{2,4})\]([^\[]+)'
1107
+ matches = re.findall(ai_pattern, data)
1108
+
1109
+ for ai, value in matches:
1110
+ parsed_data[f"AI {ai}"] = value
1111
+
1112
+ # If no AI pattern found, return original data
1113
+ if not parsed_data:
1114
+ parsed_data["Raw Data"] = data
1115
+
1116
+ return parsed_data
1117
+
1118
+ except Exception as e:
1119
+ return {"Raw Data": data, "Parse Error": str(e)}
1120
+
1121
+ def is_multi_stack_barcode(self, barcode_obj, image):
1122
+ """Detect if this is a multi-stack barcode"""
1123
+ try:
1124
+ if hasattr(barcode_obj, 'rect'):
1125
+ x, y, w, h = barcode_obj.rect
1126
+
1127
+ # Check if the barcode is unusually tall (indicating stacked format)
1128
+ aspect_ratio = h / w if w > 0 else 0
1129
+
1130
+ # DataBar Expanded and other stacked barcodes typically have aspect ratios > 0.3
1131
+ return aspect_ratio > 0.3
1132
+
1133
+ except:
1134
+ pass
1135
+
1136
+ return False
1137
+
1138
+ def detect_stack_type(self, barcode_obj, image):
1139
+ """Detect the type of multi-stack barcode"""
1140
+ try:
1141
+ if hasattr(barcode_obj, 'rect'):
1142
+ x, y, w, h = barcode_obj.rect
1143
+ aspect_ratio = h / w if w > 0 else 0
1144
+
1145
+ # Classify based on aspect ratio and barcode type
1146
+ if 'databar' in barcode_obj.type.lower():
1147
+ if aspect_ratio > 0.5:
1148
+ return "Quad Stack"
1149
+ elif aspect_ratio > 0.35:
1150
+ return "Triple Stack"
1151
+ elif aspect_ratio > 0.25:
1152
+ return "Double Stack"
1153
+ else:
1154
+ return "Single Stack"
1155
+ else:
1156
+ # For other barcode types
1157
+ if aspect_ratio > 0.4:
1158
+ return "Multi-Stack"
1159
+ else:
1160
+ return "Single Stack"
1161
+
1162
+ except:
1163
+ pass
1164
+
1165
+ return "Unknown"
1166
+
1167
+ def remove_duplicate_barcodes(self, barcodes):
1168
+ """Remove duplicate barcodes based on position and data"""
1169
+ unique_barcodes = []
1170
+ seen_positions = set()
1171
+ seen_data = set()
1172
+
1173
+ for barcode in barcodes:
1174
+ # Create position signature
1175
+ pos_signature = f"{barcode['rect'].left},{barcode['rect'].top},{barcode['rect'].width},{barcode['rect'].height}"
1176
+ data_signature = barcode['data']
1177
+
1178
+ # Check if we've seen this position or data before
1179
+ if pos_signature not in seen_positions and data_signature not in seen_data:
1180
+ unique_barcodes.append(barcode)
1181
+ seen_positions.add(pos_signature)
1182
+ seen_data.add(data_signature)
1183
+
1184
+ return unique_barcodes
1185
+
1186
+ def enhance_barcode_data(self, barcodes):
1187
+ """Enhance barcode data with additional analysis"""
1188
+ enhanced_barcodes = []
1189
+
1190
+ for barcode in barcodes:
1191
+ # Add confidence score based on method and quality
1192
+ confidence = self.calculate_confidence(barcode)
1193
+ barcode['confidence'] = confidence
1194
+
1195
+ # Add GS1 validation for DataBar
1196
+ if 'databar' in barcode['type'].lower():
1197
+ barcode['gs1_validated'] = self.validate_gs1_format(barcode['data'])
1198
+
1199
+ enhanced_barcodes.append(barcode)
1200
+
1201
+ return enhanced_barcodes
1202
+
1203
+ def calculate_confidence(self, barcode):
1204
+ """Calculate confidence score for barcode detection"""
1205
+ confidence = 50 # Base confidence
1206
+
1207
+ # Method confidence
1208
+ method_scores = {
1209
+ 'pyzbar_basic': 70,
1210
+ 'pyzbar_enhanced': 70,
1211
+ 'dynamsoft': 85, # Dynamsoft typically has higher accuracy
1212
+ 'enhanced_preprocessing_0': 65,
1213
+ 'enhanced_preprocessing_1': 60,
1214
+ 'enhanced_preprocessing_2': 55,
1215
+ 'transform_0deg': 60,
1216
+ 'transform_90deg': 50,
1217
+ 'transform_180deg': 50,
1218
+ 'transform_270deg': 50,
1219
+ 'small_barcode_detection': 75,
1220
+ 'high_res_2x': 70,
1221
+ 'high_res_3x': 65,
1222
+ 'high_res_4x': 60
1223
+ }
1224
+
1225
+ if barcode.get('method') in method_scores:
1226
+ confidence += method_scores[barcode['method']]
1227
+
1228
+ # Quality score
1229
+ if barcode.get('quality', 0) > 0:
1230
+ confidence += min(barcode['quality'], 20)
1231
+
1232
+ # DataBar specific confidence
1233
+ if 'databar' in barcode['type'].lower():
1234
+ confidence += 10
1235
+
1236
+ return min(confidence, 100)
1237
+
1238
+ def validate_gs1_format(self, data):
1239
+ """Validate GS1 format for DataBar data"""
1240
+ try:
1241
+ # Check for GS1 Application Identifiers
1242
+ ai_pattern = r'\[(\d{2,4})\]'
1243
+ matches = re.findall(ai_pattern, data)
1244
+
1245
+ if matches:
1246
+ return True
1247
+
1248
+ # Check for parentheses format
1249
+ ai_pattern_parens = r'\((\d{2,4})\)'
1250
+ matches_parens = re.findall(ai_pattern_parens, data)
1251
+
1252
+ return len(matches_parens) > 0
1253
+
1254
+ except:
1255
+ return False
1256
+
1257
+ def check_spelling(self, text):
1258
+ """
1259
+ Robust EN/FR spell check:
1260
+ - Unicode-aware tokens (keeps accents)
1261
+ - Normalizes curly quotes/ligatures
1262
+ - Heuristic per-token language (accented => FR; else EN)
1263
+ - Flags if unknown in its likely language (not both)
1264
+ """
1265
+ try:
1266
+ # normalize ligatures & curly quotes
1267
+ text = unicodedata.normalize("NFKC", text)
1268
+ text = text.replace("'", "'").replace(""", '"').replace(""", '"')
1269
+
1270
+ # unicode letters with internal ' or - allowed
1271
+ tokens = _re.findall(TOKEN_PATTERN, text, flags=_re.UNICODE if _USE_REGEX else 0)
1272
+
1273
+ issues = []
1274
+ for raw in tokens:
1275
+ t = raw.lower()
1276
+
1277
+ # skip very short, short ALL-CAPS acronyms, and whitelisted terms
1278
+ if len(t) < 3:
1279
+ continue
1280
+ if raw.isupper() and len(raw) <= 3:
1281
+ continue
1282
+ if t in DOMAIN_WHITELIST:
1283
+ continue
1284
+
1285
+ miss_en = t in self.english_spellchecker.unknown([t])
1286
+ miss_fr = t in self.french_spellchecker.unknown([t])
1287
+
1288
+ use_fr = _likely_french(raw)
1289
+
1290
+ # Prefer the likely language, but fall back to "either language unknown"
1291
+ if (use_fr and miss_fr) or ((not use_fr) and miss_en) or (miss_en and miss_fr):
1292
+ issues.append({
1293
+ "word": raw,
1294
+ "lang": "fr" if use_fr else "en",
1295
+ "suggestions_en": list(self.english_spellchecker.candidates(t))[:3],
1296
+ "suggestions_fr": list(self.french_spellchecker.candidates(t))[:3],
1297
+ })
1298
+
1299
+ return issues
1300
+ except Exception as e:
1301
+ print(f"Error checking spelling: {e}")
1302
+ return []
1303
+
1304
+ def compare_colors(self, image1, image2):
1305
+ """Compare colors between two images and return differences using RGB color space"""
1306
+ try:
1307
+ print("Starting RGB color comparison...")
1308
+
1309
+ # Convert images to same size
1310
+ img1 = np.array(image1)
1311
+ img2 = np.array(image2)
1312
+
1313
+ print(f"Image 1 shape: {img1.shape}")
1314
+ print(f"Image 2 shape: {img2.shape}")
1315
+
1316
+ # Resize images to same dimensions
1317
+ height = min(img1.shape[0], img2.shape[0])
1318
+ width = min(img1.shape[1], img2.shape[1])
1319
+
1320
+ img1_resized = cv2.resize(img1, (width, height))
1321
+ img2_resized = cv2.resize(img2, (width, height))
1322
+
1323
+ print(f"Resized to: {width}x{height}")
1324
+
1325
+ # Keep images in RGB format (no conversion to BGR)
1326
+ img1_rgb = img1_resized
1327
+ img2_rgb = img2_resized
1328
+
1329
+ color_differences = []
1330
+
1331
+ # Method 1: Enhanced RGB channel comparison with 20% more accuracy
1332
+ print("Method 1: Enhanced RGB channel comparison")
1333
+
1334
+ # Calculate absolute difference for each RGB channel with enhanced precision
1335
+ diff_r = cv2.absdiff(img1_rgb[:,:,0], img2_rgb[:,:,0]) # Red channel
1336
+ diff_g = cv2.absdiff(img1_rgb[:,:,1], img2_rgb[:,:,1]) # Green channel
1337
+ diff_b = cv2.absdiff(img1_rgb[:,:,2], img2_rgb[:,:,2]) # Blue channel
1338
+
1339
+ # Enhanced RGB combination with better weighting
1340
+ diff_combined = cv2.addWeighted(diff_r, 0.4, diff_g, 0.4, 0) # Red and Green weighted higher
1341
+ diff_combined = cv2.addWeighted(diff_combined, 1.0, diff_b, 0.2, 0) # Blue weighted lower
1342
+
1343
+ # Apply Gaussian blur to reduce noise and improve accuracy
1344
+ diff_combined = cv2.GaussianBlur(diff_combined, (3, 3), 0)
1345
+
1346
+ # Apply balanced thresholds to catch color variations while avoiding multiple boxes
1347
+ rgb_thresholds = [15, 22, 30, 40] # Balanced thresholds
1348
+
1349
+ for threshold in rgb_thresholds:
1350
+ _, thresh = cv2.threshold(diff_combined, threshold, 255, cv2.THRESH_BINARY)
1351
+
1352
+ # Apply minimal morphological operations
1353
+ kernel = np.ones((1, 1), np.uint8) # Minimal kernel to preserve detail
1354
+ thresh = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)
1355
+ thresh = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel)
1356
+
1357
+ # Find contours
1358
+ contours, _ = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
1359
+
1360
+ print(f"RGB Threshold {threshold}: Found {len(contours)} contours")
1361
+
1362
+ for contour in contours:
1363
+ area = cv2.contourArea(contour)
1364
+ if area > 15: # Balanced area threshold to catch variations while avoiding small boxes
1365
+ x, y, w, h = cv2.boundingRect(contour)
1366
+
1367
+ # Get the actual RGB colors at this location
1368
+ color1 = img1_rgb[y:y+h, x:x+w].mean(axis=(0, 1))
1369
+ color2 = img2_rgb[y:y+h, x:x+w].mean(axis=(0, 1))
1370
+
1371
+ # Calculate RGB color difference magnitude
1372
+ color_diff = np.linalg.norm(color1 - color2)
1373
+
1374
+ # Flag moderate color differences
1375
+ if color_diff > 18: # Balanced threshold
1376
+ # Check if this area is already covered (refined consolidated problem areas)
1377
+ already_covered = False
1378
+ for existing_diff in color_differences:
1379
+ if (abs(existing_diff['x'] - x) < 21 and
1380
+ abs(existing_diff['y'] - y) < 21 and
1381
+ abs(existing_diff['width'] - w) < 21 and
1382
+ abs(existing_diff['height'] - h) < 21):
1383
+ already_covered = True
1384
+ break
1385
+
1386
+ if not already_covered:
1387
+ color_differences.append({
1388
+ 'x': x,
1389
+ 'y': y,
1390
+ 'width': w,
1391
+ 'height': h,
1392
+ 'area': area,
1393
+ 'color1': color1.tolist(),
1394
+ 'color2': color2.tolist(),
1395
+ 'threshold': f"RGB_{threshold}",
1396
+ 'color_diff': color_diff,
1397
+ 'diff_r': float(abs(color1[0] - color2[0])),
1398
+ 'diff_g': float(abs(color1[1] - color2[1])),
1399
+ 'diff_b': float(abs(color1[2] - color2[2]))
1400
+ })
1401
+
1402
+ # Method 2: Enhanced HSV color space comparison with 20% more accuracy
1403
+ print("Method 2: Enhanced HSV color space comparison")
1404
+
1405
+ # Convert to HSV for better color difference detection
1406
+ img1_hsv = cv2.cvtColor(img1_rgb, cv2.COLOR_RGB2HSV)
1407
+ img2_hsv = cv2.cvtColor(img2_rgb, cv2.COLOR_RGB2HSV)
1408
+
1409
+ # Enhanced HSV comparison with better channel weighting
1410
+ hue_diff = cv2.absdiff(img1_hsv[:,:,0], img2_hsv[:,:,0]) # Hue channel
1411
+ sat_diff = cv2.absdiff(img1_hsv[:,:,1], img2_hsv[:,:,1]) # Saturation channel
1412
+ val_diff = cv2.absdiff(img1_hsv[:,:,2], img2_hsv[:,:,2]) # Value channel
1413
+
1414
+ # Enhanced HSV combination with better weighting
1415
+ hsv_combined = cv2.addWeighted(hue_diff, 0.5, sat_diff, 0.3, 0) # Hue and Saturation
1416
+ hsv_combined = cv2.addWeighted(hsv_combined, 1.0, val_diff, 0.2, 0) # Add Value channel
1417
+
1418
+ # Apply Gaussian blur to reduce noise and improve accuracy
1419
+ hsv_combined = cv2.GaussianBlur(hsv_combined, (3, 3), 0)
1420
+
1421
+ # Apply balanced HSV thresholds to catch color variations while avoiding multiple boxes
1422
+ hsv_thresholds = [18, 25, 35, 45] # Balanced HSV thresholds
1423
+
1424
+ for threshold in hsv_thresholds:
1425
+ _, hsv_thresh = cv2.threshold(hsv_combined, threshold, 255, cv2.THRESH_BINARY)
1426
+
1427
+ # Apply minimal morphological operations
1428
+ kernel = np.ones((1, 1), np.uint8)
1429
+ hsv_thresh = cv2.morphologyEx(hsv_thresh, cv2.MORPH_CLOSE, kernel)
1430
+ hsv_thresh = cv2.morphologyEx(hsv_thresh, cv2.MORPH_OPEN, kernel)
1431
+
1432
+ # Find contours
1433
+ hsv_contours, _ = cv2.findContours(hsv_thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
1434
+
1435
+ print(f"HSV Threshold {threshold}: Found {len(hsv_contours)} contours")
1436
+
1437
+ for contour in hsv_contours:
1438
+ area = cv2.contourArea(contour)
1439
+ if area > 15: # Balanced area threshold to catch variations while avoiding small boxes
1440
+ x, y, w, h = cv2.boundingRect(contour)
1441
+
1442
+ # Get the actual colors at this location
1443
+ color1 = img1_rgb[y:y+h, x:x+w].mean(axis=(0, 1))
1444
+ color2 = img2_rgb[y:y+h, x:x+w].mean(axis=(0, 1))
1445
+
1446
+ # Calculate color difference magnitude
1447
+ color_diff = np.linalg.norm(color1 - color2)
1448
+
1449
+ # Flag moderate color differences
1450
+ if color_diff > 22: # Balanced threshold
1451
+ # Check if this area is already covered (refined consolidated problem areas)
1452
+ already_covered = False
1453
+ for existing_diff in color_differences:
1454
+ if (abs(existing_diff['x'] - x) < 21 and
1455
+ abs(existing_diff['y'] - y) < 21 and
1456
+ abs(existing_diff['width'] - w) < 21 and
1457
+ abs(existing_diff['height'] - h) < 21):
1458
+ already_covered = True
1459
+ break
1460
+
1461
+ if not already_covered:
1462
+ color_differences.append({
1463
+ 'x': x,
1464
+ 'y': y,
1465
+ 'width': w,
1466
+ 'height': h,
1467
+ 'area': area,
1468
+ 'color1': color1.tolist(),
1469
+ 'color2': color2.tolist(),
1470
+ 'threshold': f"HSV_{threshold}",
1471
+ 'color_diff': color_diff,
1472
+ 'diff_r': float(abs(color1[0] - color2[0])),
1473
+ 'diff_g': float(abs(color1[1] - color2[1])),
1474
+ 'diff_b': float(abs(color1[2] - color2[2]))
1475
+ })
1476
+
1477
+ # Method 3: Enhanced pixel-by-pixel RGB comparison with 20% more accuracy
1478
+ print("Method 3: Enhanced pixel-by-pixel RGB comparison")
1479
+
1480
+ # Sample every 12th pixel for less sensitivity (20% less frequent)
1481
+ for y in range(0, height, 12):
1482
+ for x in range(0, width, 12):
1483
+ color1 = img1_rgb[y, x]
1484
+ color2 = img2_rgb[y, x]
1485
+
1486
+ # Calculate absolute difference for each RGB channel
1487
+ diff_r = abs(int(color1[0]) - int(color2[0])) # Red channel
1488
+ diff_g = abs(int(color1[1]) - int(color2[1])) # Green channel
1489
+ diff_b = abs(int(color1[2]) - int(color2[2])) # Blue channel
1490
+
1491
+ # Flag if RGB channels differ by moderate amounts
1492
+ if diff_r > 10 or diff_g > 10 or diff_b > 10:
1493
+ # Check if this area is already covered (refined consolidated problem areas)
1494
+ already_covered = False
1495
+ for existing_diff in color_differences:
1496
+ if (abs(existing_diff['x'] - x) < 21 and
1497
+ abs(existing_diff['y'] - y) < 21):
1498
+ already_covered = True
1499
+ break
1500
+
1501
+ if not already_covered:
1502
+ color_differences.append({
1503
+ 'x': x,
1504
+ 'y': y,
1505
+ 'width': 5, # Small box around the pixel
1506
+ 'height': 5,
1507
+ 'area': 25,
1508
+ 'color1': color1.tolist(),
1509
+ 'color2': color2.tolist(),
1510
+ 'threshold': 'pixel_RGB',
1511
+ 'color_diff': diff_r + diff_g + diff_b,
1512
+ 'diff_r': diff_r,
1513
+ 'diff_g': diff_g,
1514
+ 'diff_b': diff_b
1515
+ })
1516
+
1517
+ print(f"RGB color comparison completed. Found {len(color_differences)} total differences.")
1518
+
1519
+ # Method 4: LAB color space comparison for perceptual accuracy (20% more accurate)
1520
+ print("Method 4: LAB color space comparison")
1521
+
1522
+ # Convert to LAB color space for perceptual color differences
1523
+ img1_lab = cv2.cvtColor(img1_rgb, cv2.COLOR_RGB2LAB)
1524
+ img2_lab = cv2.cvtColor(img2_rgb, cv2.COLOR_RGB2LAB)
1525
+
1526
+ # Calculate LAB differences (perceptually uniform)
1527
+ lab_diff_l = cv2.absdiff(img1_lab[:,:,0], img2_lab[:,:,0]) # L channel (lightness)
1528
+ lab_diff_a = cv2.absdiff(img1_lab[:,:,1], img2_lab[:,:,1]) # a channel (green-red)
1529
+ lab_diff_b = cv2.absdiff(img1_lab[:,:,2], img2_lab[:,:,2]) # b channel (blue-yellow)
1530
+
1531
+ # Combine LAB differences with perceptual weighting
1532
+ lab_combined = cv2.addWeighted(lab_diff_l, 0.3, lab_diff_a, 0.35, 0) # L and a channels
1533
+ lab_combined = cv2.addWeighted(lab_combined, 1.0, lab_diff_b, 0.35, 0) # Add b channel
1534
+
1535
+ # Apply Gaussian blur for noise reduction
1536
+ lab_combined = cv2.GaussianBlur(lab_combined, (3, 3), 0)
1537
+
1538
+ # Apply balanced LAB thresholds to catch color variations while avoiding multiple boxes
1539
+ lab_thresholds = [20, 28, 38, 50] # Balanced LAB thresholds
1540
+
1541
+ for threshold in lab_thresholds:
1542
+ _, lab_thresh = cv2.threshold(lab_combined, threshold, 255, cv2.THRESH_BINARY)
1543
+
1544
+ # Apply morphological operations
1545
+ kernel = np.ones((1, 1), np.uint8)
1546
+ lab_thresh = cv2.morphologyEx(lab_thresh, cv2.MORPH_CLOSE, kernel)
1547
+ lab_thresh = cv2.morphologyEx(lab_thresh, cv2.MORPH_OPEN, kernel)
1548
+
1549
+ # Find contours
1550
+ lab_contours, _ = cv2.findContours(lab_thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
1551
+
1552
+ print(f"LAB Threshold {threshold}: Found {len(lab_contours)} contours")
1553
+
1554
+ for contour in lab_contours:
1555
+ area = cv2.contourArea(contour)
1556
+ if area > 15: # Balanced area threshold to catch variations while avoiding small boxes
1557
+ x, y, w, h = cv2.boundingRect(contour)
1558
+
1559
+ # Get the actual colors at this location
1560
+ color1 = img1_rgb[y:y+h, x:x+w].mean(axis=(0, 1))
1561
+ color2 = img2_rgb[y:y+h, x:x+w].mean(axis=(0, 1))
1562
+
1563
+ # Calculate color difference magnitude
1564
+ color_diff = np.linalg.norm(color1 - color2)
1565
+
1566
+ # Flag moderate color differences
1567
+ if color_diff > 22: # Balanced threshold
1568
+ # Check if this area is already covered (refined consolidated problem areas)
1569
+ already_covered = False
1570
+ for existing_diff in color_differences:
1571
+ if (abs(existing_diff['x'] - x) < 21 and
1572
+ abs(existing_diff['y'] - y) < 21 and
1573
+ abs(existing_diff['width'] - w) < 21 and
1574
+ abs(existing_diff['height'] - h) < 21):
1575
+ already_covered = True
1576
+ break
1577
+
1578
+ if not already_covered:
1579
+ color_differences.append({
1580
+ 'x': x,
1581
+ 'y': y,
1582
+ 'width': w,
1583
+ 'height': h,
1584
+ 'area': area,
1585
+ 'color1': color1.tolist(),
1586
+ 'color2': color2.tolist(),
1587
+ 'threshold': f"LAB_{threshold}",
1588
+ 'color_diff': color_diff,
1589
+ 'diff_r': float(abs(color1[0] - color2[0])),
1590
+ 'diff_g': float(abs(color1[1] - color2[1])),
1591
+ 'diff_b': float(abs(color1[2] - color2[2]))
1592
+ })
1593
+
1594
+ print(f"Enhanced color comparison completed. Found {len(color_differences)} total differences.")
1595
+
1596
+ # Group nearby differences into one perimeter box per issue area
1597
+ if color_differences:
1598
+ grouped_differences = self.group_nearby_differences(color_differences)
1599
+ print(f"Grouped into {len(grouped_differences)} perimeter boxes")
1600
+ return grouped_differences
1601
+
1602
+ return color_differences
1603
+
1604
+ except Exception as e:
1605
+ print(f"Error comparing colors: {str(e)}")
1606
+ return []
1607
+
1608
+ def group_nearby_differences(self, differences):
1609
+ """Group nearby differences into larger bounding boxes around affected areas"""
1610
+ if not differences:
1611
+ return []
1612
+
1613
+ # Sort differences by position for easier grouping
1614
+ sorted_diffs = sorted(differences, key=lambda x: (x['y'], x['x']))
1615
+
1616
+ grouped_areas = []
1617
+ current_group = []
1618
+
1619
+ for diff in sorted_diffs:
1620
+ if not current_group:
1621
+ current_group = [diff]
1622
+ else:
1623
+ # Check if this difference is close to the current group
1624
+ should_group = False
1625
+ for group_diff in current_group:
1626
+ # Calculate distance between centers
1627
+ center1_x = group_diff['x'] + group_diff['width'] // 2
1628
+ center1_y = group_diff['y'] + group_diff['height'] // 2
1629
+ center2_x = diff['x'] + diff['width'] // 2
1630
+ center2_y = diff['y'] + diff['height'] // 2
1631
+
1632
+ distance = ((center1_x - center2_x) ** 2 + (center1_y - center2_y) ** 2) ** 0.5
1633
+
1634
+ # If distance is less than 200 pixels, group them for one box per main issue
1635
+ if distance < 200:
1636
+ should_group = True
1637
+ break
1638
+
1639
+ if should_group:
1640
+ current_group.append(diff)
1641
+ else:
1642
+ # Create bounding box for current group
1643
+ if current_group:
1644
+ bounding_box = self.create_group_bounding_box(current_group)
1645
+ if bounding_box: # Only add if not None
1646
+ grouped_areas.append(bounding_box)
1647
+ current_group = [diff]
1648
+
1649
+ # Don't forget the last group
1650
+ if current_group:
1651
+ bounding_box = self.create_group_bounding_box(current_group)
1652
+ if bounding_box: # Only add if not None
1653
+ grouped_areas.append(bounding_box)
1654
+
1655
+ return grouped_areas
1656
+
1657
+ def group_nearby_differences(self, differences):
1658
+ """Group nearby differences into one perimeter box per issue area"""
1659
+ if not differences:
1660
+ return []
1661
+
1662
+ # Sort differences by position for easier grouping
1663
+ sorted_diffs = sorted(differences, key=lambda x: (x['y'], x['x']))
1664
+
1665
+ grouped_areas = []
1666
+ current_group = []
1667
+
1668
+ for diff in sorted_diffs:
1669
+ if not current_group:
1670
+ current_group = [diff]
1671
+ else:
1672
+ # Check if this difference is close to the current group
1673
+ should_group = False
1674
+ for group_diff in current_group:
1675
+ # Calculate distance between centers
1676
+ center1_x = group_diff['x'] + group_diff['width'] // 2
1677
+ center1_y = group_diff['y'] + group_diff['height'] // 2
1678
+ center2_x = diff['x'] + diff['width'] // 2
1679
+ center2_y = diff['y'] + diff['height'] // 2
1680
+
1681
+ distance = ((center1_x - center2_x) ** 2 + (center1_y - center2_y) ** 2) ** 0.5
1682
+
1683
+ # If distance is less than 234 pixels, group them for refined consolidated problem areas
1684
+ if distance < 234:
1685
+ should_group = True
1686
+ break
1687
+
1688
+ if should_group:
1689
+ current_group.append(diff)
1690
+ else:
1691
+ # Create perimeter box for current group
1692
+ if current_group:
1693
+ perimeter_box = self.create_perimeter_box(current_group)
1694
+ if perimeter_box: # Only add if not None
1695
+ grouped_areas.append(perimeter_box)
1696
+ current_group = [diff]
1697
+
1698
+ # Don't forget the last group
1699
+ if current_group:
1700
+ perimeter_box = self.create_perimeter_box(current_group)
1701
+ if perimeter_box: # Only add if not None
1702
+ grouped_areas.append(perimeter_box)
1703
+
1704
+ return grouped_areas
1705
+
1706
+ def create_perimeter_box(self, group):
1707
+ """Create a perimeter box that encompasses all differences in a group"""
1708
+ if not group:
1709
+ return None
1710
+
1711
+ # Find the overall bounding box
1712
+ min_x = min(diff['x'] - 5 for diff in group) # Include 5-pixel extension
1713
+ min_y = min(diff['y'] - 5 for diff in group) # Include 5-pixel extension
1714
+ max_x = max(diff['x'] + diff['width'] + 5 for diff in group) # Include 5-pixel extension
1715
+ max_y = max(diff['y'] + diff['height'] + 5 for diff in group) # Include 5-pixel extension
1716
+
1717
+ # Add minimal padding around the perimeter box (refined consolidated problem areas)
1718
+ padding = 7
1719
+ min_x = max(0, min_x - padding)
1720
+ min_y = max(0, min_y - padding)
1721
+ max_x = max_x + padding
1722
+ max_y = max_y + padding
1723
+
1724
+ # Calculate final dimensions
1725
+ width = max_x - min_x
1726
+ height = max_y - min_y
1727
+
1728
+ # Filter out very small groups (refined consolidated problem areas)
1729
+ if width < 26 or height < 26:
1730
+ return None
1731
+
1732
+ return {
1733
+ 'x': min_x,
1734
+ 'y': min_y,
1735
+ 'width': width,
1736
+ 'height': height,
1737
+ 'area': width * height,
1738
+ 'color1': [0, 0, 0], # Placeholder
1739
+ 'color2': [0, 0, 0], # Placeholder
1740
+ 'threshold': 'perimeter',
1741
+ 'color_diff': 1.0,
1742
+ 'num_original_differences': len(group)
1743
+ }
1744
+
1745
+ def create_annotated_image(self, image, differences, output_path):
1746
+ """Create annotated image with red boxes around differences"""
1747
+ try:
1748
+ print(f"Creating annotated image: {output_path}")
1749
+ print(f"Number of differences to annotate: {len(differences)}")
1750
+
1751
+ # Create a copy of the image
1752
+ annotated_image = image.copy()
1753
+ draw = ImageDraw.Draw(annotated_image)
1754
+
1755
+ # Draw red rectangles around differences
1756
+ for i, diff in enumerate(differences):
1757
+ x, y, w, h = diff['x'], diff['y'], diff['width'], diff['height']
1758
+
1759
+ # Draw thicker red rectangle
1760
+ draw.rectangle([x, y, x + w, y + h], outline='red', width=5)
1761
+
1762
+ print(f"Drawing rectangle {i+1}: ({x}, {y}) to ({x+w}, {y+h})")
1763
+
1764
+ # Save annotated image
1765
+ annotated_image.save(output_path)
1766
+ print(f"Annotated image saved successfully: {output_path}")
1767
+
1768
+ except Exception as e:
1769
+ print(f"Error creating annotated image: {str(e)}")
1770
+ # Try to save the original image as fallback
1771
+ try:
1772
+ image.save(output_path)
1773
+ print(f"Saved original image as fallback: {output_path}")
1774
+ except Exception as e2:
1775
+ print(f"Failed to save fallback image: {str(e2)}")
1776
+
1777
+ def compare_pdfs(self, pdf1_path, pdf2_path, session_id):
1778
+ """Main comparison function with improved error handling"""
1779
+ try:
1780
+ print("Starting PDF comparison...")
1781
+ start_time = time.time()
1782
+
1783
+ # Validate both PDFs contain "50 Carroll"
1784
+ print("Validating PDF 1...")
1785
+ if not self.validate_pdf(pdf1_path):
1786
+ raise Exception("INVALID DOCUMENT")
1787
+
1788
+ print("Validating PDF 2...")
1789
+ if not self.validate_pdf(pdf2_path):
1790
+ raise Exception("INVALID DOCUMENT")
1791
+
1792
+ # Extract text and images from both PDFs
1793
+ print("Extracting text from PDF 1...")
1794
+ pdf1_data = self.extract_text_from_pdf(pdf1_path)
1795
+ if not pdf1_data:
1796
+ raise Exception("INVALID DOCUMENT")
1797
+
1798
+ print("Extracting text from PDF 2...")
1799
+ pdf2_data = self.extract_text_from_pdf(pdf2_path)
1800
+ if not pdf2_data:
1801
+ raise Exception("INVALID DOCUMENT")
1802
+
1803
+ # Initialize results
1804
+ results = {
1805
+ 'session_id': session_id,
1806
+ 'validation': {
1807
+ 'pdf1_valid': True,
1808
+ 'pdf2_valid': True,
1809
+ 'validation_text': '50 Carroll'
1810
+ },
1811
+ 'text_comparison': [],
1812
+ 'spelling_issues': [],
1813
+ 'barcodes_qr_codes': [],
1814
+ 'color_differences': [],
1815
+ 'annotated_images': []
1816
+ }
1817
+
1818
+ # Compare text and check spelling
1819
+ print("Processing pages...")
1820
+ for i, (page1, page2) in enumerate(zip(pdf1_data, pdf2_data)):
1821
+ print(f"Processing page {i + 1}...")
1822
+ page_results = {
1823
+ 'page': i + 1,
1824
+ 'text_differences': [],
1825
+ 'spelling_issues_pdf1': [],
1826
+ 'spelling_issues_pdf2': [],
1827
+ 'barcodes_pdf1': [],
1828
+ 'barcodes_pdf2': [],
1829
+ 'color_differences': []
1830
+ }
1831
+
1832
+ # Check spelling for both PDFs
1833
+ print(f"Checking spelling for page {i + 1}...")
1834
+ page_results['spelling_issues_pdf1'] = self.check_spelling(page1['text'])
1835
+ page_results['spelling_issues_pdf2'] = self.check_spelling(page2['text'])
1836
+
1837
+ # Add spelling issues to text differences for UI visibility
1838
+ if page_results['spelling_issues_pdf1'] or page_results['spelling_issues_pdf2']:
1839
+ page_results['text_differences'].append({
1840
+ "type": "spelling",
1841
+ "pdf1": [i["word"] for i in page_results['spelling_issues_pdf1']],
1842
+ "pdf2": [i["word"] for i in page_results['spelling_issues_pdf2']],
1843
+ })
1844
+
1845
+ # Create spelling-only annotated images (one box per error)
1846
+ spell_dir = f'static/results/{session_id}'
1847
+ os.makedirs(spell_dir, exist_ok=True)
1848
+
1849
+ spell_img1 = page1['image'].copy()
1850
+ spell_img2 = page2['image'].copy()
1851
+ spell_img1 = self.annotate_spelling_errors_on_image(spell_img1, page_results['spelling_issues_pdf1'])
1852
+ spell_img2 = self.annotate_spelling_errors_on_image(spell_img2, page_results['spelling_issues_pdf2'])
1853
+
1854
+ spell_path1 = f'{spell_dir}/page_{i+1}_pdf1_spelling.png'
1855
+ spell_path2 = f'{spell_dir}/page_{i+1}_pdf2_spelling.png'
1856
+ spell_img1.save(spell_path1)
1857
+ spell_img2.save(spell_path2)
1858
+
1859
+ # link them into the results for your UI
1860
+ page_results.setdefault('annotated_images', {})
1861
+ page_results['annotated_images'].update({
1862
+ 'pdf1_spelling': f'results/{session_id}/page_{i+1}_pdf1_spelling.png',
1863
+ 'pdf2_spelling': f'results/{session_id}/page_{i+1}_pdf2_spelling.png',
1864
+ })
1865
+
1866
+ # Detect barcodes and QR codes
1867
+ print(f"Detecting barcodes for page {i + 1} PDF 1...")
1868
+ page_results['barcodes_pdf1'] = self.detect_barcodes_qr_codes(page1['image']) or []
1869
+
1870
+ print(f"Detecting barcodes for page {i + 1} PDF 2...")
1871
+ page_results['barcodes_pdf2'] = self.detect_barcodes_qr_codes(page2['image']) or []
1872
+
1873
+ # Compare colors
1874
+ print(f"Comparing colors for page {i + 1}...")
1875
+ color_diffs = self.compare_colors(page1['image'], page2['image'])
1876
+ page_results['color_differences'] = color_diffs
1877
+
1878
+ # Create annotated images and save original images
1879
+ print(f"Creating images for page {i + 1}...")
1880
+ output_dir = f'static/results/{session_id}'
1881
+ os.makedirs(output_dir, exist_ok=True)
1882
+
1883
+ # Save original images
1884
+ original_path1 = f'{output_dir}/page_{i+1}_pdf1_original.png'
1885
+ original_path2 = f'{output_dir}/page_{i+1}_pdf2_original.png'
1886
+
1887
+ page1['image'].save(original_path1)
1888
+ page2['image'].save(original_path2)
1889
+
1890
+ # Create annotated images if there are color differences
1891
+ if color_diffs:
1892
+ print(f"Creating annotated images for page {i + 1}...")
1893
+ annotated_path1 = f'{output_dir}/page_{i+1}_pdf1_annotated.png'
1894
+ annotated_path2 = f'{output_dir}/page_{i+1}_pdf2_annotated.png'
1895
+
1896
+ self.create_annotated_image(page1['image'], color_diffs, annotated_path1)
1897
+ self.create_annotated_image(page2['image'], color_diffs, annotated_path2)
1898
+
1899
+ page_results['annotated_images'] = {
1900
+ 'pdf1': f'results/{session_id}/page_{i+1}_pdf1_annotated.png',
1901
+ 'pdf2': f'results/{session_id}/page_{i+1}_pdf2_annotated.png'
1902
+ }
1903
+ else:
1904
+ # If no color differences, use original images
1905
+ page_results['annotated_images'] = {
1906
+ 'pdf1': f'results/{session_id}/page_{i+1}_pdf1_original.png',
1907
+ 'pdf2': f'results/{session_id}/page_{i+1}_pdf2_original.png'
1908
+ }
1909
+
1910
+ results['text_comparison'].append(page_results)
1911
+
1912
+ # Aggregate spelling issues
1913
+ print("Aggregating results...")
1914
+ all_spelling_issues = []
1915
+ for page in results['text_comparison']:
1916
+ all_spelling_issues.extend(page['spelling_issues_pdf1'])
1917
+ all_spelling_issues.extend(page['spelling_issues_pdf2'])
1918
+
1919
+ results['spelling_issues'] = all_spelling_issues
1920
+
1921
+ # Aggregate barcodes and QR codes
1922
+ all_barcodes = []
1923
+ for page in results['text_comparison']:
1924
+ all_barcodes.extend(page['barcodes_pdf1'])
1925
+ all_barcodes.extend(page['barcodes_pdf2'])
1926
+
1927
+ results['barcodes_qr_codes'] = all_barcodes
1928
+
1929
+ elapsed_time = time.time() - start_time
1930
+ print(f"PDF comparison completed in {elapsed_time:.2f} seconds.")
1931
+
1932
+ return results
1933
+
1934
+ except Exception as e:
1935
+ print(f"Error in PDF comparison: {str(e)}")
1936
+ raise Exception(f"INVALID DOCUMENT")
1937
+ # Enhanced OCR for tiny fonts - deployment check
1938
+ # Force rebuild - Thu Sep 4 09:33:44 EDT 2025
ProofCheck/requirements.txt ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Flask==2.3.3
2
+ Werkzeug==2.3.7
3
+ PyPDF2==3.0.1
4
+ pdf2image==1.16.3
5
+ Pillow==10.0.1
6
+ opencv-python==4.8.1.78
7
+ pytesseract==0.3.10
8
+ pyzbar==0.1.9
9
+ pyspellchecker==0.7.2
10
+ nltk==3.8.1
11
+ numpy==1.24.3
12
+ scikit-image==0.21.0
13
+ matplotlib==3.7.2
14
+ pandas==2.0.3
15
+ reportlab==4.0.4
16
+ python-barcode==0.15.1
17
+ zxing-cpp==2.0.0
18
+ dbr==9.6.30
19
+ PyMuPDF==1.23.8
20
+ regex==2023.10.3
ProofCheck/run.py ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Startup script for PDF Comparison Tool
4
+ """
5
+
6
+ import os
7
+ import sys
8
+ import subprocess
9
+ import webbrowser
10
+ import time
11
+ from pathlib import Path
12
+
13
+ def check_python_version():
14
+ """Check if Python version is compatible"""
15
+ if sys.version_info < (3, 7):
16
+ print("❌ Python 3.7 or higher is required")
17
+ print(f"Current version: {sys.version}")
18
+ return False
19
+ print(f"βœ… Python {sys.version.split()[0]} is compatible")
20
+ return True
21
+
22
+ def check_dependencies():
23
+ """Check if required dependencies are installed"""
24
+ try:
25
+ import flask
26
+ import cv2
27
+ import numpy
28
+ import PIL
29
+ import pytesseract
30
+ import pdf2image
31
+ import pyzbar
32
+ import spellchecker
33
+ import nltk
34
+ import skimage
35
+ print("βœ… All Python dependencies are installed")
36
+ return True
37
+ except ImportError as e:
38
+ print(f"❌ Missing dependency: {e}")
39
+ print("Please run: pip install -r requirements.txt")
40
+ return False
41
+
42
+ def check_tesseract():
43
+ """Check if Tesseract OCR is installed"""
44
+ try:
45
+ import pytesseract
46
+ pytesseract.get_tesseract_version()
47
+ print("βœ… Tesseract OCR is available")
48
+ return True
49
+ except Exception as e:
50
+ print(f"❌ Tesseract OCR not found: {e}")
51
+ print("Please install Tesseract:")
52
+ print(" macOS: brew install tesseract")
53
+ print(" Ubuntu: sudo apt-get install tesseract-ocr")
54
+ print(" Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki")
55
+ return False
56
+
57
+ def create_directories():
58
+ """Create necessary directories"""
59
+ directories = ['uploads', 'results', 'static/results']
60
+ for directory in directories:
61
+ Path(directory).mkdir(parents=True, exist_ok=True)
62
+ print("βœ… Directories created")
63
+
64
+ def start_application():
65
+ """Start the Flask application"""
66
+ print("\nπŸš€ Starting PDF Comparison Tool...")
67
+ print("πŸ“± The application will be available at: http://localhost:5000")
68
+ print("⏹️ Press Ctrl+C to stop the application")
69
+ print("-" * 50)
70
+
71
+ try:
72
+ # Start the Flask app
73
+ from app import app
74
+ app.run(debug=True, host='0.0.0.0', port=5000)
75
+ except KeyboardInterrupt:
76
+ print("\nπŸ‘‹ Application stopped by user")
77
+ except Exception as e:
78
+ print(f"❌ Error starting application: {e}")
79
+ return False
80
+
81
+ return True
82
+
83
+ def main():
84
+ """Main startup function"""
85
+ print("=" * 50)
86
+ print("πŸ“„ PDF Comparison Tool")
87
+ print("=" * 50)
88
+
89
+ # Check requirements
90
+ if not check_python_version():
91
+ sys.exit(1)
92
+
93
+ if not check_dependencies():
94
+ sys.exit(1)
95
+
96
+ if not check_tesseract():
97
+ sys.exit(1)
98
+
99
+ # Create directories
100
+ create_directories()
101
+
102
+ # Ask user if they want to open browser
103
+ try:
104
+ response = input("\n🌐 Open browser automatically? (y/n): ").lower().strip()
105
+ if response in ['y', 'yes']:
106
+ # Wait a moment for the server to start
107
+ def open_browser():
108
+ time.sleep(2)
109
+ webbrowser.open('http://localhost:5000')
110
+
111
+ import threading
112
+ browser_thread = threading.Thread(target=open_browser)
113
+ browser_thread.daemon = True
114
+ browser_thread.start()
115
+ except KeyboardInterrupt:
116
+ print("\nπŸ‘‹ Setup cancelled by user")
117
+ sys.exit(0)
118
+
119
+ # Start the application
120
+ start_application()
121
+
122
+ if __name__ == "__main__":
123
+ main()
ProofCheck/static/css/style.css ADDED
@@ -0,0 +1,324 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /* Custom styles for PDF Comparison Tool */
2
+
3
+ body {
4
+ background-color: hsl(202, 68%, 79%);
5
+ font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
6
+ }
7
+
8
+ .navbar-brand {
9
+ font-weight: 600;
10
+ font-size: 1.5rem;
11
+ }
12
+
13
+ .card {
14
+ border: none;
15
+ border-radius: 12px;
16
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
17
+ transition: transform 0.2s ease-in-out;
18
+ }
19
+
20
+ .card:hover {
21
+ transform: translateY(-2px);
22
+ }
23
+
24
+ .card-header {
25
+ border-radius: 12px 12px 0 0 !important;
26
+ border-bottom: none;
27
+ font-weight: 600;
28
+ }
29
+
30
+ .btn-primary {
31
+ background: linear-gradient(135deg, #007bff, #0056b3);
32
+ border: none;
33
+ border-radius: 8px;
34
+ font-weight: 600;
35
+ padding: 12px 24px;
36
+ transition: all 0.3s ease;
37
+ }
38
+
39
+ .btn-primary:hover {
40
+ background: linear-gradient(135deg, #0056b3, #004085);
41
+ transform: translateY(-1px);
42
+ box-shadow: 0 4px 8px rgba(0, 123, 255, 0.3);
43
+ }
44
+
45
+ .form-control {
46
+ border-radius: 8px;
47
+ border: 2px solid #e9ecef;
48
+ padding: 12px 16px;
49
+ transition: border-color 0.3s ease;
50
+ }
51
+
52
+ .form-control:focus {
53
+ border-color: #007bff;
54
+ box-shadow: 0 0 0 0.2rem rgba(0, 123, 255, 0.25);
55
+ }
56
+
57
+ /* Drag and Drop Styles */
58
+ .drag-drop-zone {
59
+ position: relative;
60
+ border: 3px dashed #dee2e6;
61
+ border-radius: 12px;
62
+ padding: 40px 20px;
63
+ text-align: center;
64
+ background-color: #f8f9fa;
65
+ transition: all 0.3s ease;
66
+ cursor: pointer;
67
+ min-height: 200px;
68
+ display: flex;
69
+ align-items: center;
70
+ justify-content: center;
71
+ }
72
+
73
+ .drag-drop-zone:hover {
74
+ border-color: #007bff;
75
+ background-color: #f0f8ff;
76
+ }
77
+
78
+ .drag-drop-zone.drag-over {
79
+ border-color: #28a745;
80
+ background-color: #f0fff0;
81
+ transform: scale(1.02);
82
+ }
83
+
84
+ .drag-drop-zone.has-file {
85
+ border-color: #28a745;
86
+ background-color: #f0fff0;
87
+ }
88
+
89
+ .drag-drop-content {
90
+ pointer-events: none;
91
+ z-index: 1;
92
+ }
93
+
94
+ .drag-drop-text {
95
+ font-size: 1.1rem;
96
+ font-weight: 600;
97
+ color: #495057;
98
+ margin-bottom: 8px;
99
+ }
100
+
101
+ .drag-drop-hint {
102
+ font-size: 0.9rem;
103
+ color: #6c757d;
104
+ margin-bottom: 0;
105
+ }
106
+
107
+ .drag-drop-input {
108
+ position: absolute;
109
+ top: 0;
110
+ left: 0;
111
+ width: 100%;
112
+ height: 100%;
113
+ opacity: 0;
114
+ cursor: pointer;
115
+ z-index: 2;
116
+ }
117
+
118
+ .drag-drop-zone .file-info {
119
+ display: none;
120
+ margin-top: 15px;
121
+ }
122
+
123
+ .drag-drop-zone.has-file .file-info {
124
+ display: block;
125
+ }
126
+
127
+ .drag-drop-zone.has-file .drag-drop-content {
128
+ display: none;
129
+ }
130
+
131
+ .file-info {
132
+ background: rgba(40, 167, 69, 0.1);
133
+ border: 1px solid #28a745;
134
+ border-radius: 8px;
135
+ padding: 10px;
136
+ margin-top: 10px;
137
+ }
138
+
139
+ .file-info i {
140
+ color: #28a745;
141
+ margin-right: 8px;
142
+ }
143
+
144
+ .nav-tabs .nav-link {
145
+ border: none;
146
+ border-radius: 8px 8px 0 0;
147
+ color: #6c757d;
148
+ font-weight: 500;
149
+ padding: 12px 20px;
150
+ transition: all 0.3s ease;
151
+ }
152
+
153
+ .nav-tabs .nav-link:hover {
154
+ color: #007bff;
155
+ background-color: #f8f9fa;
156
+ }
157
+
158
+ .nav-tabs .nav-link.active {
159
+ background-color: #007bff;
160
+ color: white;
161
+ border: none;
162
+ }
163
+
164
+ .alert {
165
+ border-radius: 8px;
166
+ border: none;
167
+ font-weight: 500;
168
+ }
169
+
170
+ .spinner-border {
171
+ width: 3rem;
172
+ height: 3rem;
173
+ }
174
+
175
+ .progress {
176
+ height: 8px;
177
+ border-radius: 4px;
178
+ }
179
+
180
+ .progress-bar {
181
+ border-radius: 4px;
182
+ }
183
+
184
+ /* Comparison results styling */
185
+ .comparison-image {
186
+ max-width: 100%;
187
+ height: auto;
188
+ border-radius: 8px;
189
+ box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);
190
+ margin: 10px 0;
191
+ }
192
+
193
+ .difference-box {
194
+ border: 3px solid #dc3545;
195
+ border-radius: 4px;
196
+ position: relative;
197
+ }
198
+
199
+ .difference-box::after {
200
+ content: "Difference";
201
+ position: absolute;
202
+ top: -10px;
203
+ left: 10px;
204
+ background: #dc3545;
205
+ color: white;
206
+ padding: 2px 8px;
207
+ border-radius: 4px;
208
+ font-size: 12px;
209
+ font-weight: bold;
210
+ }
211
+
212
+ /* Table styling */
213
+ .table {
214
+ border-radius: 8px;
215
+ overflow: hidden;
216
+ box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);
217
+ }
218
+
219
+ .table thead th {
220
+ background-color: #f8f9fa;
221
+ border-bottom: 2px solid #dee2e6;
222
+ font-weight: 600;
223
+ color: #495057;
224
+ }
225
+
226
+ .table tbody tr:hover {
227
+ background-color: #f8f9fa;
228
+ }
229
+
230
+ /* Badge styling */
231
+ .badge {
232
+ font-size: 0.8em;
233
+ padding: 6px 10px;
234
+ border-radius: 6px;
235
+ }
236
+
237
+ .badge-danger {
238
+ background-color: #dc3545;
239
+ }
240
+
241
+ .badge-warning {
242
+ background-color: #ffc107;
243
+ color: #212529;
244
+ }
245
+
246
+ .badge-success {
247
+ background-color: #28a745;
248
+ }
249
+
250
+ .badge-info {
251
+ background-color: #17a2b8;
252
+ }
253
+
254
+ /* Responsive design */
255
+ @media (max-width: 768px) {
256
+ .container {
257
+ padding: 0 15px;
258
+ }
259
+
260
+ .card {
261
+ margin-bottom: 20px;
262
+ }
263
+
264
+ .nav-tabs .nav-link {
265
+ padding: 8px 12px;
266
+ font-size: 14px;
267
+ }
268
+
269
+ .btn-lg {
270
+ padding: 10px 20px;
271
+ font-size: 16px;
272
+ }
273
+
274
+ .drag-drop-zone {
275
+ min-height: 150px;
276
+ padding: 30px 15px;
277
+ }
278
+
279
+ .drag-drop-text {
280
+ font-size: 1rem;
281
+ }
282
+ }
283
+
284
+ /* Loading animation */
285
+ @keyframes pulse {
286
+ 0% { opacity: 1; }
287
+ 50% { opacity: 0.5; }
288
+ 100% { opacity: 1; }
289
+ }
290
+
291
+ .loading-pulse {
292
+ animation: pulse 1.5s infinite;
293
+ }
294
+
295
+ /* Custom scrollbar */
296
+ ::-webkit-scrollbar {
297
+ width: 8px;
298
+ }
299
+
300
+ ::-webkit-scrollbar-track {
301
+ background: #f1f1f1;
302
+ border-radius: 4px;
303
+ }
304
+
305
+ ::-webkit-scrollbar-thumb {
306
+ background: #c1c1c1;
307
+ border-radius: 4px;
308
+ }
309
+
310
+ ::-webkit-scrollbar-thumb:hover {
311
+ background: #a8a8a8;
312
+ }
313
+
314
+ /* Print styles */
315
+ @media print {
316
+ .navbar, .btn, .nav-tabs {
317
+ display: none !important;
318
+ }
319
+
320
+ .card {
321
+ box-shadow: none !important;
322
+ border: 1px solid #dee2e6 !important;
323
+ }
324
+ }
ProofCheck/static/js/script.js ADDED
@@ -0,0 +1,353 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // PDF Comparison Tool JavaScript
2
+
3
+ document.addEventListener('DOMContentLoaded', function() {
4
+ const uploadForm = document.getElementById('uploadForm');
5
+ const loadingSection = document.getElementById('loadingSection');
6
+ const resultsSection = document.getElementById('resultsSection');
7
+ const errorSection = document.getElementById('errorSection');
8
+ const errorMessage = document.getElementById('errorMessage');
9
+
10
+ // Initialize drag and drop zones
11
+ initializeDragAndDrop('dragZone1', 'pdf1');
12
+ initializeDragAndDrop('dragZone2', 'pdf2');
13
+
14
+ // Handle form submission
15
+ uploadForm.addEventListener('submit', function(e) {
16
+ e.preventDefault();
17
+
18
+ const formData = new FormData(uploadForm);
19
+ const pdf1 = document.getElementById('pdf1').files[0];
20
+ const pdf2 = document.getElementById('pdf2').files[0];
21
+
22
+ // Validate files
23
+ if (!pdf1 || !pdf2) {
24
+ showError('Please select both PDF files.');
25
+ return;
26
+ }
27
+
28
+ if (!pdf1.name.toLowerCase().endsWith('.pdf') || !pdf2.name.toLowerCase().endsWith('.pdf')) {
29
+ showError('Please select valid PDF files.');
30
+ return;
31
+ }
32
+
33
+ // Show loading
34
+ showLoading();
35
+ hideError();
36
+
37
+ // Submit form via AJAX
38
+ fetch('/upload', {
39
+ method: 'POST',
40
+ body: formData
41
+ })
42
+ .then(response => response.json())
43
+ .then(data => {
44
+ hideLoading();
45
+
46
+ if (data.success) {
47
+ displayResults(data.results);
48
+ } else {
49
+ showError(data.error || 'An error occurred during comparison.');
50
+ }
51
+ })
52
+ .catch(error => {
53
+ hideLoading();
54
+ showError('Network error: ' + error.message);
55
+ });
56
+ });
57
+
58
+ function initializeDragAndDrop(zoneId, inputId) {
59
+ const zone = document.getElementById(zoneId);
60
+ const input = document.getElementById(inputId);
61
+
62
+ if (!zone || !input) return;
63
+
64
+ // Create file info display
65
+ const fileInfo = document.createElement('div');
66
+ fileInfo.className = 'file-info';
67
+ fileInfo.innerHTML = '<i class="fas fa-file-pdf"></i><span class="file-name"></span>';
68
+ zone.appendChild(fileInfo);
69
+
70
+ // Drag and drop events
71
+ zone.addEventListener('dragover', function(e) {
72
+ e.preventDefault();
73
+ e.stopPropagation();
74
+ zone.classList.add('drag-over');
75
+ });
76
+
77
+ zone.addEventListener('dragleave', function(e) {
78
+ e.preventDefault();
79
+ e.stopPropagation();
80
+ zone.classList.remove('drag-over');
81
+ });
82
+
83
+ zone.addEventListener('drop', function(e) {
84
+ e.preventDefault();
85
+ e.stopPropagation();
86
+ zone.classList.remove('drag-over');
87
+
88
+ const files = e.dataTransfer.files;
89
+ if (files.length > 0) {
90
+ const file = files[0];
91
+ if (file.type === 'application/pdf' || file.name.toLowerCase().endsWith('.pdf')) {
92
+ handleFileSelect(file, input, zone);
93
+ } else {
94
+ showError('Please select a valid PDF file.');
95
+ }
96
+ }
97
+ });
98
+
99
+ // Click to browse
100
+ zone.addEventListener('click', function(e) {
101
+ if (e.target !== input) {
102
+ input.click();
103
+ }
104
+ });
105
+
106
+ // File input change
107
+ input.addEventListener('change', function(e) {
108
+ const file = e.target.files[0];
109
+ if (file) {
110
+ handleFileSelect(file, input, zone);
111
+ }
112
+ });
113
+ }
114
+
115
+ function handleFileSelect(file, input, zone) {
116
+ // Update the file input
117
+ const dataTransfer = new DataTransfer();
118
+ dataTransfer.items.add(file);
119
+ input.files = dataTransfer.files;
120
+
121
+ // Update visual feedback
122
+ zone.classList.add('has-file');
123
+ const fileName = zone.querySelector('.file-name');
124
+ if (fileName) {
125
+ fileName.textContent = file.name;
126
+ }
127
+
128
+ // Update form text
129
+ const formText = zone.querySelector('.drag-drop-hint');
130
+ if (formText) {
131
+ formText.textContent = `Selected: ${file.name}`;
132
+ }
133
+ }
134
+
135
+ function showLoading() {
136
+ loadingSection.style.display = 'block';
137
+ resultsSection.style.display = 'none';
138
+ errorSection.style.display = 'none';
139
+ }
140
+
141
+ function hideLoading() {
142
+ loadingSection.style.display = 'none';
143
+ }
144
+
145
+ function showError(message) {
146
+ errorMessage.textContent = message;
147
+ errorSection.style.display = 'block';
148
+ resultsSection.style.display = 'none';
149
+ }
150
+
151
+ function hideError() {
152
+ errorSection.style.display = 'none';
153
+ }
154
+
155
+ function displayResults(results) {
156
+ resultsSection.style.display = 'block';
157
+
158
+ // Display visual comparison
159
+ displayVisualComparison(results);
160
+
161
+ // Display spelling issues
162
+ displaySpellingIssues(results);
163
+
164
+ // Display barcodes and QR codes
165
+ displayBarcodes(results);
166
+ }
167
+
168
+ function displayVisualComparison(results) {
169
+ const visualContent = document.getElementById('visualComparisonContent');
170
+ let html = '<div class="row">';
171
+
172
+ if (results.text_comparison && results.text_comparison.length > 0) {
173
+ results.text_comparison.forEach((page, index) => {
174
+ html += `
175
+ <div class="col-12 mb-4">
176
+ <h6 class="text-primary mb-3">Page ${page.page}</h6>
177
+ <div class="row">
178
+ <div class="col-md-6">
179
+ <h6>PDF 1</h6>
180
+ ${page.annotated_images && page.annotated_images.pdf1 ?
181
+ `<img src="/static/${page.annotated_images.pdf1}" class="comparison-image" alt="PDF 1 Page ${page.page}">` :
182
+ '<p class="text-muted">No differences detected</p>'
183
+ }
184
+ </div>
185
+ <div class="col-md-6">
186
+ <h6>PDF 2</h6>
187
+ ${page.annotated_images && page.annotated_images.pdf2 ?
188
+ `<img src="/static/${page.annotated_images.pdf2}" class="comparison-image" alt="PDF 2 Page ${page.page}">` :
189
+ '<p class="text-muted">No differences detected</p>'
190
+ }
191
+ </div>
192
+ </div>
193
+ ${page.color_differences && page.color_differences.length > 0 ?
194
+ `<div class="mt-3">
195
+ <span class="badge badge-danger">${page.color_differences.length} color difference(s) detected</span>
196
+ </div>` :
197
+ '<div class="mt-3"><span class="badge badge-success">No color differences</span></div>'
198
+ }
199
+ </div>
200
+ `;
201
+ });
202
+ } else {
203
+ html += '<div class="col-12"><p class="text-muted">No visual comparison data available.</p></div>';
204
+ }
205
+
206
+ html += '</div>';
207
+ visualContent.innerHTML = html;
208
+ }
209
+
210
+ function displaySpellingIssues(results) {
211
+ const spellingContent = document.getElementById('spellingIssuesContent');
212
+ let html = '';
213
+
214
+ if (results.spelling_issues && results.spelling_issues.length > 0) {
215
+ html += `
216
+ <div class="table-responsive">
217
+ <table class="table table-striped">
218
+ <thead>
219
+ <tr>
220
+ <th>Word</th>
221
+ <th>Original</th>
222
+ <th>Misspelled In</th>
223
+ <th>English Suggestions</th>
224
+ <th>French Suggestions</th>
225
+ </tr>
226
+ </thead>
227
+ <tbody>
228
+ `;
229
+
230
+ results.spelling_issues.forEach(issue => {
231
+ const misspelledIn = issue.misspelled_in ? issue.misspelled_in.join(', ') : 'Unknown';
232
+ const englishSuggestions = issue.suggestions.english ? issue.suggestions.english.join(', ') : 'None';
233
+ const frenchSuggestions = issue.suggestions.french ? issue.suggestions.french.join(', ') : 'None';
234
+
235
+ html += `
236
+ <tr>
237
+ <td><strong>${issue.word}</strong></td>
238
+ <td><code>${issue.original_word}</code></td>
239
+ <td><span class="badge badge-warning">${misspelledIn}</span></td>
240
+ <td>${englishSuggestions}</td>
241
+ <td>${frenchSuggestions}</td>
242
+ </tr>
243
+ `;
244
+ });
245
+
246
+ html += `
247
+ </tbody>
248
+ </table>
249
+ </div>
250
+ <div class="mt-3">
251
+ <span class="badge badge-warning">${results.spelling_issues.length} spelling issue(s) found</span>
252
+ </div>
253
+ `;
254
+ } else {
255
+ html = '<div class="alert alert-success"><i class="fas fa-check me-2"></i>No spelling issues detected.</div>';
256
+ }
257
+
258
+ spellingContent.innerHTML = html;
259
+ }
260
+
261
+ function displayBarcodes(results) {
262
+ const barcodesContent = document.getElementById('barcodesContent');
263
+ let html = '';
264
+
265
+ if (results.barcodes_qr_codes && results.barcodes_qr_codes.length > 0) {
266
+ html += `
267
+ <div class="table-responsive">
268
+ <table class="table table-striped">
269
+ <thead>
270
+ <tr>
271
+ <th>Type</th>
272
+ <th>Data</th>
273
+ <th>Stack Type</th>
274
+ <th>Size</th>
275
+ <th>Method</th>
276
+ <th>Confidence</th>
277
+ <th>GS1 Valid</th>
278
+ <th>Position</th>
279
+ </tr>
280
+ </thead>
281
+ <tbody>
282
+ `;
283
+
284
+ results.barcodes_qr_codes.forEach(barcode => {
285
+ const position = `(${barcode.rect.left}, ${barcode.rect.top}) - (${barcode.rect.left + barcode.rect.width}, ${barcode.rect.top + barcode.rect.height})`;
286
+ const stackType = barcode.stack_type || 'Single Stack';
287
+ const method = barcode.method || 'Unknown';
288
+ const confidence = barcode.confidence || 0;
289
+ const gs1Valid = barcode.gs1_validated ? 'Yes' : 'No';
290
+ const sizeCategory = barcode.size_category || 'Normal';
291
+ const resolution = barcode.resolution || '';
292
+
293
+ // Format DataBar Expanded data if available
294
+ let dataDisplay = barcode.data;
295
+ if (barcode.expanded_data) {
296
+ dataDisplay = '<div><strong>Parsed Data:</strong><br>';
297
+ for (const [key, value] of Object.entries(barcode.expanded_data)) {
298
+ dataDisplay += `<span class="badge badge-info">${key}: ${value}</span> `;
299
+ }
300
+ dataDisplay += '</div>';
301
+ }
302
+
303
+ // Confidence color coding
304
+ let confidenceClass = 'badge-secondary';
305
+ if (confidence >= 80) confidenceClass = 'badge-success';
306
+ else if (confidence >= 60) confidenceClass = 'badge-warning';
307
+ else if (confidence >= 40) confidenceClass = 'badge-info';
308
+
309
+ // GS1 validation color
310
+ let gs1Class = barcode.gs1_validated ? 'badge-success' : 'badge-danger';
311
+
312
+ // Size category color
313
+ let sizeClass = 'badge-secondary';
314
+ if (sizeCategory === 'small') sizeClass = 'badge-warning';
315
+ else if (sizeCategory === 'tiny') sizeClass = 'badge-danger';
316
+
317
+ // Method display with resolution
318
+ let methodDisplay = method;
319
+ if (resolution) {
320
+ methodDisplay += `<br><small>${resolution}</small>`;
321
+ }
322
+
323
+ html += `
324
+ <tr>
325
+ <td><span class="badge badge-info">${barcode.type}</span></td>
326
+ <td>${dataDisplay}</td>
327
+ <td><span class="badge badge-secondary">${stackType}</span></td>
328
+ <td><span class="badge ${sizeClass}">${sizeCategory}</span></td>
329
+ <td><span class="badge badge-dark">${methodDisplay}</span></td>
330
+ <td><span class="badge ${confidenceClass}">${confidence}%</span></td>
331
+ <td><span class="badge ${gs1Class}">${gs1Valid}</span></td>
332
+ <td><small>${position}</small></td>
333
+ </tr>
334
+ `;
335
+ });
336
+
337
+ html += `
338
+ </tbody>
339
+ </table>
340
+ </div>
341
+ <div class="mt-3">
342
+ <span class="badge badge-info">${results.barcodes_qr_codes.length} barcode/QR code(s) detected</span>
343
+ <span class="badge badge-success">Enhanced DataBar detection active</span>
344
+ <span class="badge badge-warning">Small barcode detection active</span>
345
+ </div>
346
+ `;
347
+ } else {
348
+ html = '<div class="alert alert-info"><i class="fas fa-info-circle me-2"></i>No barcodes or QR codes detected.</div>';
349
+ }
350
+
351
+ barcodesContent.innerHTML = html;
352
+ }
353
+ });
ProofCheck/templates/index.html ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>PDF Comparison Tool</title>
7
+ <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css" rel="stylesheet">
8
+ <link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/all.min.css" rel="stylesheet">
9
+ <link href="{{ url_for('static', filename='css/style.css') }}" rel="stylesheet">
10
+ </head>
11
+ <body>
12
+ <div class="container-fluid">
13
+ <div class="row">
14
+ <!-- Header -->
15
+ <div class="col-12">
16
+ <nav class="navbar navbar-expand-lg navbar-dark bg-primary">
17
+ <div class="container">
18
+ <a class="navbar-brand" href="#">
19
+ <i class="fas fa-file-pdf me-2"></i>
20
+ PDF Comparison Tool
21
+ </a>
22
+ </div>
23
+ </nav>
24
+ </div>
25
+ </div>
26
+
27
+ <div class="row mt-4">
28
+ <div class="col-12">
29
+ <div class="container">
30
+ <!-- Upload Section -->
31
+ <div class="card shadow-sm">
32
+ <div class="card-header bg-light">
33
+ <h5 class="mb-0">
34
+ <i class="fas fa-upload me-2"></i>
35
+ Upload PDF Files for Comparison
36
+ </h5>
37
+ </div>
38
+ <div class="card-body">
39
+ <form id="uploadForm" enctype="multipart/form-data">
40
+ <div class="row">
41
+ <div class="col-md-6">
42
+ <div class="mb-3">
43
+ <label for="pdf1" class="form-label">First PDF File</label>
44
+ <div class="drag-drop-zone" id="dragZone1">
45
+ <div class="drag-drop-content">
46
+ <i class="fas fa-cloud-upload-alt fa-3x text-muted mb-3"></i>
47
+ <p class="drag-drop-text">Drag & drop PDF here or click to browse</p>
48
+ <p class="drag-drop-hint">Select a PDF file for comparison</p>
49
+ </div>
50
+ <input type="file" class="form-control drag-drop-input" id="pdf1" name="pdf1" accept=".pdf" required>
51
+ </div>
52
+ </div>
53
+ </div>
54
+ <div class="col-md-6">
55
+ <div class="mb-3">
56
+ <label for="pdf2" class="form-label">Second PDF File</label>
57
+ <div class="drag-drop-zone" id="dragZone2">
58
+ <div class="drag-drop-content">
59
+ <i class="fas fa-cloud-upload-alt fa-3x text-muted mb-3"></i>
60
+ <p class="drag-drop-text">Drag & drop PDF here or click to browse</p>
61
+ <p class="drag-drop-hint">Select a PDF file for comparison</p>
62
+ </div>
63
+ <input type="file" class="form-control drag-drop-input" id="pdf2" name="pdf2" accept=".pdf" required>
64
+ </div>
65
+ </div>
66
+ </div>
67
+ </div>
68
+ <div class="d-grid">
69
+ <button type="submit" class="btn btn-primary btn-lg">
70
+ <i class="fas fa-search me-2"></i>
71
+ Compare PDFs
72
+ </button>
73
+ </div>
74
+ </form>
75
+ </div>
76
+ </div>
77
+
78
+ <!-- Loading Section -->
79
+ <div id="loadingSection" class="card shadow-sm mt-4" style="display: none;">
80
+ <div class="card-body text-center">
81
+ <div class="spinner-border text-primary" role="status">
82
+ <span class="visually-hidden">Loading...</span>
83
+ </div>
84
+ <p class="mt-3">Processing PDFs... This may take a few minutes.</p>
85
+ <div class="progress mt-3">
86
+ <div class="progress-bar progress-bar-striped progress-bar-animated" role="progressbar" style="width: 100%"></div>
87
+ </div>
88
+ </div>
89
+ </div>
90
+
91
+ <!-- Results Section -->
92
+ <div id="resultsSection" class="mt-4" style="display: none;">
93
+ <!-- Comparison Results Tabs -->
94
+ <div class="card shadow-sm">
95
+ <div class="card-header">
96
+ <ul class="nav nav-tabs card-header-tabs" id="resultsTabs" role="tablist">
97
+ <li class="nav-item" role="presentation">
98
+ <button class="nav-link active" id="visual-tab" data-bs-toggle="tab" data-bs-target="#visual" type="button" role="tab">
99
+ <i class="fas fa-eye me-2"></i>Visual Comparison
100
+ </button>
101
+ </li>
102
+ <li class="nav-item" role="presentation">
103
+ <button class="nav-link" id="spelling-tab" data-bs-toggle="tab" data-bs-target="#spelling" type="button" role="tab">
104
+ <i class="fas fa-spell-check me-2"></i>Spelling Issues
105
+ </button>
106
+ </li>
107
+ <li class="nav-item" role="presentation">
108
+ <button class="nav-link" id="barcodes-tab" data-bs-toggle="tab" data-bs-target="#barcodes" type="button" role="tab">
109
+ <i class="fas fa-barcode me-2"></i>Barcodes & QR Codes
110
+ </button>
111
+ </li>
112
+ </ul>
113
+ </div>
114
+ <div class="card-body">
115
+ <div class="tab-content" id="resultsTabContent">
116
+ <!-- Visual Comparison Tab -->
117
+ <div class="tab-pane fade show active" id="visual" role="tabpanel">
118
+ <div id="visualComparisonContent">
119
+ <!-- Content will be populated by JavaScript -->
120
+ </div>
121
+ </div>
122
+
123
+ <!-- Spelling Issues Tab -->
124
+ <div class="tab-pane fade" id="spelling" role="tabpanel">
125
+ <div id="spellingIssuesContent">
126
+ <!-- Content will be populated by JavaScript -->
127
+ </div>
128
+ </div>
129
+
130
+ <!-- Barcodes Tab -->
131
+ <div class="tab-pane fade" id="barcodes" role="tabpanel">
132
+ <div id="barcodesContent">
133
+ <!-- Content will be populated by JavaScript -->
134
+ </div>
135
+ </div>
136
+ </div>
137
+ </div>
138
+ </div>
139
+ </div>
140
+
141
+ <!-- Error Section -->
142
+ <div id="errorSection" class="alert alert-danger mt-4" style="display: none;">
143
+ <i class="fas fa-exclamation-triangle me-2"></i>
144
+ <span id="errorMessage"></span>
145
+ </div>
146
+ </div>
147
+ </div>
148
+ </div>
149
+ </div>
150
+
151
+ <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/js/bootstrap.bundle.min.js"></script>
152
+ <script src="{{ url_for('static', filename='js/script.js') }}"></script>
153
+ </body>
154
+ </html>
ProofCheck/test_setup.py ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script to verify PDF Comparison Tool setup
4
+ """
5
+
6
+ import sys
7
+ import importlib
8
+
9
+ def test_imports():
10
+ """Test if all required packages can be imported"""
11
+ required_packages = [
12
+ 'flask',
13
+ 'cv2',
14
+ 'numpy',
15
+ 'PIL',
16
+ 'pytesseract',
17
+ 'pdf2image',
18
+ 'pyzbar',
19
+ 'spellchecker',
20
+ 'nltk',
21
+ 'skimage',
22
+ 'matplotlib',
23
+ 'pandas'
24
+ ]
25
+
26
+ print("Testing package imports...")
27
+ failed_imports = []
28
+
29
+ for package in required_packages:
30
+ try:
31
+ importlib.import_module(package)
32
+ print(f"βœ“ {package}")
33
+ except ImportError as e:
34
+ print(f"βœ— {package}: {e}")
35
+ failed_imports.append(package)
36
+
37
+ return failed_imports
38
+
39
+ def test_tesseract():
40
+ """Test if Tesseract OCR is available"""
41
+ print("\nTesting Tesseract OCR...")
42
+ try:
43
+ import pytesseract
44
+ # Try to get Tesseract version
45
+ version = pytesseract.get_tesseract_version()
46
+ print(f"βœ“ Tesseract version: {version}")
47
+ return True
48
+ except Exception as e:
49
+ print(f"βœ— Tesseract not found: {e}")
50
+ print("Please install Tesseract OCR:")
51
+ print(" macOS: brew install tesseract")
52
+ print(" Ubuntu: sudo apt-get install tesseract-ocr")
53
+ print(" Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki")
54
+ return False
55
+
56
+ def test_pdf_comparator():
57
+ """Test if PDFComparator class can be instantiated"""
58
+ print("\nTesting PDFComparator...")
59
+ try:
60
+ from pdf_comparator import PDFComparator
61
+ comparator = PDFComparator()
62
+ print("βœ“ PDFComparator initialized successfully")
63
+ return True
64
+ except Exception as e:
65
+ print(f"βœ— PDFComparator error: {e}")
66
+ return False
67
+
68
+ def test_flask_app():
69
+ """Test if Flask app can be imported"""
70
+ print("\nTesting Flask application...")
71
+ try:
72
+ from app import app
73
+ print("βœ“ Flask app imported successfully")
74
+ return True
75
+ except Exception as e:
76
+ print(f"βœ— Flask app error: {e}")
77
+ return False
78
+
79
+ def main():
80
+ """Run all tests"""
81
+ print("PDF Comparison Tool - Setup Test")
82
+ print("=" * 40)
83
+
84
+ # Test imports
85
+ failed_imports = test_imports()
86
+
87
+ # Test Tesseract
88
+ tesseract_ok = test_tesseract()
89
+
90
+ # Test PDFComparator
91
+ comparator_ok = test_pdf_comparator()
92
+
93
+ # Test Flask app
94
+ flask_ok = test_flask_app()
95
+
96
+ # Summary
97
+ print("\n" + "=" * 40)
98
+ print("SETUP SUMMARY")
99
+ print("=" * 40)
100
+
101
+ if failed_imports:
102
+ print(f"βœ— Missing packages: {', '.join(failed_imports)}")
103
+ print("Run: pip install -r requirements.txt")
104
+ else:
105
+ print("βœ“ All packages imported successfully")
106
+
107
+ if tesseract_ok:
108
+ print("βœ“ Tesseract OCR is available")
109
+ else:
110
+ print("βœ— Tesseract OCR is not available")
111
+
112
+ if comparator_ok:
113
+ print("βœ“ PDFComparator is working")
114
+ else:
115
+ print("βœ— PDFComparator has issues")
116
+
117
+ if flask_ok:
118
+ print("βœ“ Flask application is ready")
119
+ else:
120
+ print("βœ— Flask application has issues")
121
+
122
+ # Overall status
123
+ all_ok = not failed_imports and tesseract_ok and comparator_ok and flask_ok
124
+
125
+ if all_ok:
126
+ print("\nπŸŽ‰ Setup is complete! You can run the application with:")
127
+ print(" python app.py")
128
+ else:
129
+ print("\n⚠️ Setup is incomplete. Please fix the issues above.")
130
+ sys.exit(1)
131
+
132
+ if __name__ == "__main__":
133
+ main()
README.md ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PDF Comparison Tool
2
+
3
+ A comprehensive web-based tool for comparing PDF documents with advanced features including OCR validation, color difference detection, spelling verification, and barcode/QR code detection.
4
+
5
+ ## Features
6
+
7
+ - **PDF Validation**: Ensures uploaded PDFs contain "50 Carroll" using OCR
8
+ - **Color Difference Detection**: Identifies visual differences between PDFs and highlights them with red boxes
9
+ - **Spelling Verification**: Checks text against both English and French dictionaries
10
+ - **Barcode/QR Code Detection**: Automatically detects and reads barcodes and QR codes
11
+ - **Visual Comparison**: Side-by-side comparison with annotated differences
12
+ - **Modern Web Interface**: Responsive design with Bootstrap and custom styling
13
+
14
+ ## Requirements
15
+
16
+ ### System Requirements
17
+ - Python 3.7 or higher
18
+ - macOS, Linux, or Windows
19
+ - Tesseract OCR engine (for text extraction)
20
+
21
+ ### Python Dependencies
22
+ All dependencies are listed in `requirements.txt`:
23
+ - Flask (web framework)
24
+ - PyPDF2 (PDF processing)
25
+ - pdf2image (PDF to image conversion)
26
+ - OpenCV (image processing)
27
+ - pytesseract (OCR)
28
+ - pyzbar (barcode detection)
29
+ - pyspellchecker (spelling verification)
30
+ - scikit-image (image comparison)
31
+ - Pillow (image manipulation)
32
+
33
+ ## Installation
34
+
35
+ ### 1. Install Tesseract OCR
36
+
37
+ **macOS:**
38
+ ```bash
39
+ brew install tesseract
40
+ ```
41
+
42
+ **Ubuntu/Debian:**
43
+ ```bash
44
+ sudo apt-get install tesseract-ocr
45
+ ```
46
+
47
+ **Windows:**
48
+ Download from [Tesseract GitHub](https://github.com/UB-Mannheim/tesseract/wiki)
49
+
50
+ ### 2. Install Python Dependencies
51
+
52
+ ```bash
53
+ # Create virtual environment (recommended)
54
+ python -m venv venv
55
+ source venv/bin/activate # On Windows: venv\Scripts\activate
56
+
57
+ # Install dependencies
58
+ pip install -r requirements.txt
59
+ ```
60
+
61
+ ### 3. Download Language Data (if needed)
62
+
63
+ The application will automatically download required NLTK data on first run.
64
+
65
+ ## Usage
66
+
67
+ ### 1. Start the Application
68
+
69
+ ```bash
70
+ python app.py
71
+ ```
72
+
73
+ The application will start on `http://localhost:5000`
74
+
75
+ ### 2. Upload PDFs
76
+
77
+ 1. Open your web browser and navigate to `http://localhost:5000`
78
+ 2. Select two PDF files for comparison
79
+ 3. Both PDFs must contain "50 Carroll" for validation
80
+ 4. Click "Compare PDFs" to start the analysis
81
+
82
+ ### 3. View Results
83
+
84
+ The comparison results are displayed in three tabs:
85
+
86
+ - **Visual Comparison**: Side-by-side view with red boxes highlighting differences
87
+ - **Spelling Issues**: Table of spelling errors with suggestions from English and French dictionaries
88
+ - **Barcodes & QR Codes**: List of detected barcodes with their data and positions
89
+
90
+ ## File Structure
91
+
92
+ ```
93
+ ProofCheck/
94
+ β”œβ”€β”€ app.py # Main Flask application
95
+ β”œβ”€β”€ pdf_comparator.py # PDF comparison logic
96
+ β”œβ”€β”€ requirements.txt # Python dependencies
97
+ β”œβ”€β”€ README.md # This file
98
+ β”œβ”€β”€ templates/
99
+ β”‚ └── index.html # Main web interface
100
+ β”œβ”€β”€ static/
101
+ β”‚ β”œβ”€β”€ css/
102
+ β”‚ β”‚ └── style.css # Custom styles
103
+ β”‚ β”œβ”€β”€ js/
104
+ β”‚ β”‚ └── script.js # Frontend JavaScript
105
+ β”‚ └── results/ # Generated comparison images
106
+ β”œβ”€β”€ uploads/ # Temporary uploaded files
107
+ └── results/ # Comparison results JSON files
108
+ ```
109
+
110
+ ## How It Works
111
+
112
+ ### 1. PDF Validation
113
+ - Converts PDF pages to images using `pdf2image`
114
+ - Uses Tesseract OCR to extract text
115
+ - Validates presence of "50 Carroll" in extracted text
116
+
117
+ ### 2. Color Difference Detection
118
+ - Converts PDF pages to images
119
+ - Resizes images to same dimensions
120
+ - Uses structural similarity index (SSIM) to detect differences
121
+ - Draws red rectangles around detected differences
122
+
123
+ ### 3. Spelling Verification
124
+ - Extracts text using OCR
125
+ - Splits text into individual words
126
+ - Checks each word against English and French dictionaries
127
+ - Provides spelling suggestions for incorrect words
128
+
129
+ ### 4. Barcode/QR Code Detection
130
+ - Uses `pyzbar` library to detect barcodes and QR codes
131
+ - Extracts data and position information
132
+ - Displays results in organized table format
133
+
134
+ ## Configuration
135
+
136
+ ### Environment Variables
137
+ - `FLASK_ENV`: Set to `development` for debug mode
138
+ - `MAX_CONTENT_LENGTH`: Maximum file upload size (default: 16MB)
139
+
140
+ ### Customization
141
+ - Modify `pdf_comparator.py` to change comparison algorithms
142
+ - Update `static/css/style.css` for custom styling
143
+ - Edit `templates/index.html` for interface changes
144
+
145
+ ## Troubleshooting
146
+
147
+ ### Common Issues
148
+
149
+ 1. **Tesseract not found**
150
+ - Ensure Tesseract is installed and in your system PATH
151
+ - On macOS, try: `brew install tesseract`
152
+
153
+ 2. **PDF processing errors**
154
+ - Check that PDFs are not corrupted
155
+ - Ensure PDFs contain readable text (not just images)
156
+
157
+ 3. **Memory issues with large PDFs**
158
+ - Reduce DPI in `pdf_comparator.py` (default: 200)
159
+ - Process PDFs page by page for very large documents
160
+
161
+ 4. **Spelling checker not working**
162
+ - Ensure internet connection for first run (downloads dictionary data)
163
+ - Check that `pyspellchecker` is properly installed
164
+
165
+ ### Performance Tips
166
+
167
+ - Use smaller DPI values for faster processing
168
+ - Limit PDF page count for large documents
169
+ - Ensure sufficient RAM for image processing
170
+
171
+ ## Security Considerations
172
+
173
+ - Uploaded files are stored temporarily and cleaned up
174
+ - File size limits prevent DoS attacks
175
+ - Input validation prevents malicious file uploads
176
+ - Session-based file handling ensures isolation
177
+
178
+ ## Contributing
179
+
180
+ 1. Fork the repository
181
+ 2. Create a feature branch
182
+ 3. Make your changes
183
+ 4. Add tests if applicable
184
+ 5. Submit a pull request
185
+
186
+ ## License
187
+
188
+ This project is open source and available under the MIT License.
189
+
190
+ ## Support
191
+
192
+ For issues and questions:
193
+ 1. Check the troubleshooting section
194
+ 2. Review the code comments
195
+ 3. Create an issue on the repository
196
+
197
+ ## Future Enhancements
198
+
199
+ - Support for more document formats
200
+ - Advanced text comparison algorithms
201
+ - Machine learning-based difference detection
202
+ - Batch processing capabilities
203
+ - Export functionality for comparison reports
app.py ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import uuid
3
+ import json
4
+ from flask import Flask, request, render_template, jsonify, send_file
5
+ from werkzeug.utils import secure_filename
6
+ from pdf_comparator import PDFComparator
7
+ import tempfile
8
+ import shutil
9
+
10
+ app = Flask(__name__)
11
+ app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024 # 16MB max file size
12
+ app.config['UPLOAD_FOLDER'] = 'uploads'
13
+ app.config['RESULTS_FOLDER'] = 'results'
14
+
15
+ # Ensure directories exist
16
+ os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)
17
+ os.makedirs(app.config['RESULTS_FOLDER'], exist_ok=True)
18
+
19
+ ALLOWED_EXTENSIONS = {'pdf'}
20
+
21
+ def allowed_file(filename):
22
+ return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
23
+
24
+ @app.route('/')
25
+ def index():
26
+ return render_template('index.html')
27
+
28
+ @app.route('/upload', methods=['POST'])
29
+ def upload_files():
30
+ if 'pdf1' not in request.files or 'pdf2' not in request.files:
31
+ return jsonify({'error': 'Both PDF files are required'}), 400
32
+
33
+ pdf1 = request.files['pdf1']
34
+ pdf2 = request.files['pdf2']
35
+
36
+ if pdf1.filename == '' or pdf2.filename == '':
37
+ return jsonify({'error': 'Both PDF files are required'}), 400
38
+
39
+ if not (allowed_file(pdf1.filename) and allowed_file(pdf2.filename)):
40
+ return jsonify({'error': 'Only PDF files are allowed'}), 400
41
+
42
+ # Create unique session directory
43
+ session_id = str(uuid.uuid4())
44
+ session_dir = os.path.join(app.config['UPLOAD_FOLDER'], session_id)
45
+ os.makedirs(session_dir, exist_ok=True)
46
+
47
+ # Save uploaded files
48
+ pdf1_path = os.path.join(session_dir, secure_filename(pdf1.filename))
49
+ pdf2_path = os.path.join(session_dir, secure_filename(pdf2.filename))
50
+
51
+ pdf1.save(pdf1_path)
52
+ pdf2.save(pdf2_path)
53
+
54
+ try:
55
+ # Initialize PDF comparator
56
+ comparator = PDFComparator()
57
+
58
+ # Perform comparison
59
+ results = comparator.compare_pdfs(pdf1_path, pdf2_path, session_id)
60
+
61
+ # Save results
62
+ results_path = os.path.join(app.config['RESULTS_FOLDER'], f'{session_id}_results.json')
63
+ with open(results_path, 'w') as f:
64
+ json.dump(results, f, indent=2)
65
+
66
+ return jsonify({
67
+ 'success': True,
68
+ 'session_id': session_id,
69
+ 'results': results
70
+ })
71
+
72
+ except Exception as e:
73
+ return jsonify({'error': str(e)}), 500
74
+
75
+ @app.route('/results/<session_id>')
76
+ def get_results(session_id):
77
+ results_path = os.path.join(app.config['RESULTS_FOLDER'], f'{session_id}_results.json')
78
+
79
+ if not os.path.exists(results_path):
80
+ return jsonify({'error': 'Results not found'}), 404
81
+
82
+ with open(results_path, 'r') as f:
83
+ results = json.load(f)
84
+
85
+ return jsonify(results)
86
+
87
+ @app.route('/download/<session_id>/<filename>')
88
+ def download_file(session_id, filename):
89
+ file_path = os.path.join(app.config['UPLOAD_FOLDER'], session_id, filename)
90
+
91
+ if not os.path.exists(file_path):
92
+ return jsonify({'error': 'File not found'}), 404
93
+
94
+ return send_file(file_path, as_attachment=True)
95
+
96
+ if __name__ == '__main__':
97
+ app.run(debug=True, host='0.0.0.0', port=5000)
pdf_comparator.py ADDED
@@ -0,0 +1,551 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import cv2
3
+ import numpy as np
4
+ from PIL import Image, ImageDraw, ImageFont
5
+ import pytesseract
6
+ from pdf2image import convert_from_path
7
+ from pyzbar.pyzbar import decode
8
+ from spellchecker import SpellChecker
9
+ import nltk
10
+ from skimage.metrics import structural_similarity as ssim
11
+ from skimage import color
12
+ import json
13
+ import tempfile
14
+ import shutil
15
+ import unicodedata
16
+ import regex as re
17
+
18
+ # Domain whitelist for spell checking
19
+ DOMAIN_WHITELIST = {
20
+ # units / abbreviations
21
+ "mg", "mg/g", "ml", "g", "thc", "cbd", "tcm", "mct",
22
+ # common packaging terms / bilingual words you expect
23
+ "gouttes", "tennir", "net", "zoom", "tytann", "dome", "drops",
24
+ # brand or proper names you want to ignore completely
25
+ "purified", "brands", "tytann", "dome", "drops",
26
+ }
27
+ # lowercase everything in whitelist for comparisons
28
+ DOMAIN_WHITELIST = {w.lower() for w in DOMAIN_WHITELIST}
29
+
30
+ # Safe import for regex with fallback
31
+ try:
32
+ import regex as _re
33
+ _USE_REGEX = True
34
+ except ImportError:
35
+ import re as _re
36
+ _USE_REGEX = False
37
+
38
+ TOKEN_PATTERN = r"(?:\p{L})(?:[\p{L}'-]{1,})" if _USE_REGEX else r"[A-Za-z][A-Za-z'-]{1,}"
39
+
40
+ class PDFComparator:
41
+ def __init__(self):
42
+ # Initialize spell checkers for English and French
43
+ self.english_spellchecker = SpellChecker(language='en')
44
+ self.french_spellchecker = SpellChecker(language='fr')
45
+
46
+ # Add domain whitelist to spell checkers
47
+ for w in DOMAIN_WHITELIST:
48
+ self.english_spellchecker.word_frequency.add(w)
49
+ self.french_spellchecker.word_frequency.add(w)
50
+
51
+ # Download required NLTK data
52
+ try:
53
+ nltk.data.find('tokenizers/punkt')
54
+ except LookupError:
55
+ nltk.download('punkt')
56
+
57
+ def enhance_image_for_tiny_fonts(self, image):
58
+ """Enhance image specifically for tiny font OCR"""
59
+ try:
60
+ gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
61
+ clahe = cv2.createCLAHE(clipLimit=3.0, tileGridSize=(8,8))
62
+ enhanced = clahe.apply(gray)
63
+ denoised = cv2.bilateralFilter(enhanced, 9, 75, 75)
64
+ gaussian = cv2.GaussianBlur(denoised, (0, 0), 2.0)
65
+ unsharp_mask = cv2.addWeighted(denoised, 1.5, gaussian, -0.5, 0)
66
+ thresh = cv2.adaptiveThreshold(unsharp_mask, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
67
+ kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 1))
68
+ cleaned = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)
69
+ return cleaned
70
+ except Exception as e:
71
+ print(f"Error enhancing image for tiny fonts: {str(e)}")
72
+ return image
73
+
74
+ def create_inverted_image(self, image):
75
+ """Create inverted image for white text on dark backgrounds"""
76
+ try:
77
+ gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
78
+ inverted = cv2.bitwise_not(gray)
79
+ clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
80
+ enhanced = clahe.apply(inverted)
81
+ _, thresh = cv2.threshold(enhanced, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
82
+ return thresh
83
+ except Exception as e:
84
+ print(f"Error creating inverted image: {str(e)}")
85
+ return image
86
+
87
+ def extract_color_channels(self, image):
88
+ """Extract text from different color channels"""
89
+ try:
90
+ # RGB channels
91
+ b, g, r = cv2.split(image)
92
+
93
+ # HSV channels
94
+ hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
95
+ h, s, v = cv2.split(hsv)
96
+
97
+ # LAB channels
98
+ lab = cv2.cvtColor(image, cv2.COLOR_BGR2LAB)
99
+ l, a, b_lab = cv2.split(lab)
100
+
101
+ channels = [r, g, b, v, l]
102
+ texts = []
103
+
104
+ for channel in channels:
105
+ _, thresh = cv2.threshold(channel, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
106
+ text = pytesseract.image_to_string(thresh, config='--oem 3 --psm 6')
107
+ if text.strip():
108
+ texts.append(text)
109
+
110
+ return texts
111
+ except Exception as e:
112
+ print(f"Error extracting color channels: {str(e)}")
113
+ return []
114
+
115
+ def create_edge_enhanced_image(self, image):
116
+ """Create edge-enhanced image for text detection"""
117
+ try:
118
+ gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
119
+ edges = cv2.Canny(gray, 50, 150)
120
+ kernel = np.ones((2,2), np.uint8)
121
+ dilated = cv2.dilate(edges, kernel, iterations=1)
122
+ inverted = cv2.bitwise_not(dilated)
123
+ return inverted
124
+ except Exception as e:
125
+ print(f"Error creating edge-enhanced image: {str(e)}")
126
+ return image
127
+
128
+ def ocr_with_multiple_configs(self, image):
129
+ """Run OCR with multiple configurations and return best result"""
130
+ configs = [
131
+ '--oem 3 --psm 6', # Uniform block of text
132
+ '--oem 3 --psm 8', # Single word
133
+ '--oem 3 --psm 13', # Raw line
134
+ '--oem 1 --psm 6', # LSTM + Uniform block
135
+ '--oem 3 --psm 3', # Fully automatic page segmentation
136
+ ]
137
+
138
+ best_text = ""
139
+ best_length = 0
140
+
141
+ for config in configs:
142
+ try:
143
+ text = pytesseract.image_to_string(image, config=config)
144
+ if len(text.strip()) > best_length:
145
+ best_text = text
146
+ best_length = len(text.strip())
147
+ except Exception as e:
148
+ print(f"OCR config {config} failed: {str(e)}")
149
+ continue
150
+
151
+ return best_text
152
+
153
+ def extract_multi_color_text(self, image):
154
+ """Extract text using multiple preprocessing methods"""
155
+ texts = []
156
+
157
+ # Method 1: Standard black text
158
+ enhanced = self.enhance_image_for_tiny_fonts(image)
159
+ text1 = self.ocr_with_multiple_configs(enhanced)
160
+ if text1.strip():
161
+ texts.append(text1)
162
+
163
+ # Method 2: Inverted text (white on dark)
164
+ inverted = self.create_inverted_image(image)
165
+ text2 = self.ocr_with_multiple_configs(inverted)
166
+ if text2.strip():
167
+ texts.append(text2)
168
+
169
+ # Method 3: Color channel separation
170
+ color_texts = self.extract_color_channels(image)
171
+ texts.extend(color_texts)
172
+
173
+ # Method 4: Edge-enhanced
174
+ edge_enhanced = self.create_edge_enhanced_image(image)
175
+ text4 = self.ocr_with_multiple_configs(edge_enhanced)
176
+ if text4.strip():
177
+ texts.append(text4)
178
+
179
+ # Combine all texts and return the best one
180
+ combined_text = " ".join(texts)
181
+ return combined_text
182
+
183
+ def validate_pdf(self, pdf_path):
184
+ """Validate that PDF contains '50 Carroll' using enhanced OCR"""
185
+ try:
186
+ # Multiple DPI settings for better detection
187
+ dpi_settings = [200, 300, 400]
188
+
189
+ for dpi in dpi_settings:
190
+ try:
191
+ images = convert_from_path(pdf_path, dpi=dpi)
192
+
193
+ for page_num, image in enumerate(images):
194
+ # Convert PIL image to OpenCV format
195
+ opencv_image = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)
196
+
197
+ # Enhanced text extraction
198
+ text = self.extract_multi_color_text(opencv_image)
199
+
200
+ # Check for "50 Carroll" with multiple patterns
201
+ patterns = ["50 Carroll", "50 carroll", "50Carroll", "50 carroll"]
202
+ for pattern in patterns:
203
+ if pattern in text:
204
+ return True
205
+
206
+ # Also try standard OCR as fallback
207
+ standard_text = pytesseract.image_to_string(opencv_image, config='--oem 3 --psm 6')
208
+ for pattern in patterns:
209
+ if pattern in standard_text:
210
+ return True
211
+
212
+ except Exception as e:
213
+ print(f"DPI {dpi} failed: {str(e)}")
214
+ continue
215
+
216
+ return False
217
+
218
+ except Exception as e:
219
+ raise Exception(f"Error validating PDF: {str(e)}")
220
+
221
+ def extract_text_from_pdf(self, pdf_path):
222
+ """Extract text from PDF using enhanced OCR"""
223
+ try:
224
+ # Use higher DPI for better text extraction
225
+ images = convert_from_path(pdf_path, dpi=300)
226
+ all_text = []
227
+
228
+ for page_num, image in enumerate(images):
229
+ opencv_image = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)
230
+
231
+ # Enhanced text extraction
232
+ text = self.extract_multi_color_text(opencv_image)
233
+
234
+ # Fallback to standard OCR if enhanced extraction is empty
235
+ if not text.strip():
236
+ text = pytesseract.image_to_string(opencv_image, config='--oem 3 --psm 6')
237
+
238
+ all_text.append({
239
+ 'page': page_num + 1,
240
+ 'text': text,
241
+ 'image': image
242
+ })
243
+
244
+ return all_text
245
+
246
+ except Exception as e:
247
+ raise Exception(f"Error extracting text from PDF: {str(e)}")
248
+
249
+ def _likely_french(self, token: str) -> bool:
250
+ """Helper function to guess if a token is likely French"""
251
+ if _USE_REGEX:
252
+ # any Latin letter outside ASCII => probably FR (Γ©, Γ¨, ç…)
253
+ return bool(_re.search(r"[\p{Letter}&&\p{Latin}&&[^A-Za-z]]", token))
254
+ # fallback: any non-ascii letter
255
+ return any((not ('a' <= c.lower() <= 'z')) and c.isalpha() for c in token)
256
+
257
+ def check_spelling(self, text):
258
+ """
259
+ Robust EN/FR spell check:
260
+ - Unicode-aware tokens (keeps accents)
261
+ - Normalizes curly quotes/ligatures
262
+ - Heuristic per-token language (accented => FR; else EN)
263
+ - Flags if unknown in its likely language (not both)
264
+ """
265
+ try:
266
+ text = unicodedata.normalize("NFKC", text)
267
+ text = text.replace("'", "'").replace(""", '"').replace(""", '"')
268
+
269
+ tokens = _re.findall(TOKEN_PATTERN, text, flags=_re.UNICODE if _USE_REGEX else 0)
270
+
271
+ issues = []
272
+ for raw in tokens:
273
+ t = raw.lower()
274
+
275
+ # skip very short, short ALL-CAPS acronyms, and whitelisted terms
276
+ if len(t) < 3:
277
+ continue
278
+ if raw.isupper() and len(raw) <= 3: # Changed from <=5 to <=3
279
+ continue
280
+ if t in DOMAIN_WHITELIST:
281
+ continue
282
+
283
+ miss_en = t in self.english_spellchecker.unknown([t])
284
+ miss_fr = t in self.french_spellchecker.unknown([t])
285
+
286
+ use_fr = self._likely_french(raw)
287
+
288
+ # Prefer the likely language, but fall back to "either language unknown"
289
+ if (use_fr and miss_fr) or ((not use_fr) and miss_en) or (miss_en and miss_fr):
290
+ issues.append({
291
+ "word": raw,
292
+ "lang": "fr" if use_fr else "en",
293
+ "suggestions_en": list(self.english_spellchecker.candidates(t))[:3],
294
+ "suggestions_fr": list(self.french_spellchecker.candidates(t))[:3],
295
+ })
296
+
297
+ return issues
298
+ except Exception as e:
299
+ print(f"Error checking spelling: {e}")
300
+ return []
301
+
302
+ def annotate_spelling_errors_on_image(self, pil_image, misspelled):
303
+ """
304
+ Draw one red rectangle around each misspelled token using Tesseract word boxes.
305
+ 'misspelled' must be a list of dicts with 'word' keys (from check_spelling).
306
+ """
307
+ if not misspelled:
308
+ return pil_image
309
+
310
+ def _norm(s: str) -> str:
311
+ return unicodedata.normalize("NFKC", s).replace("'","'").strip(".,:;!?)(").lower()
312
+
313
+ miss_set = {_norm(m["word"]) for m in misspelled}
314
+
315
+ img = pil_image
316
+ try:
317
+ data = pytesseract.image_to_data(
318
+ img,
319
+ lang="eng+fra", # Added lang parameter
320
+ config="--oem 3 --psm 6",
321
+ output_type=pytesseract.Output.DICT,
322
+ )
323
+ except Exception as e:
324
+ print("image_to_data failed:", e)
325
+ return img
326
+
327
+ draw = ImageDraw.Draw(img)
328
+ n = len(data.get("text", []))
329
+ for i in range(n):
330
+ word = (data["text"][i] or "").strip()
331
+ if not word:
332
+ continue
333
+ clean = _norm(word) # Used _norm function
334
+
335
+ if clean and clean in miss_set:
336
+ x, y, w, h = data["left"][i], data["top"][i], data["width"][i], data["height"][i]
337
+ draw.rectangle([x, y, x + w, y + h], outline="red", width=4)
338
+
339
+ return img
340
+
341
+ def detect_barcodes_qr_codes(self, image):
342
+ """Detect and decode barcodes and QR codes"""
343
+ try:
344
+ # Convert PIL image to OpenCV format
345
+ opencv_image = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)
346
+
347
+ # Decode barcodes and QR codes
348
+ decoded_objects = decode(opencv_image)
349
+
350
+ barcodes = []
351
+ for obj in decoded_objects:
352
+ barcode_info = {
353
+ 'type': obj.type,
354
+ 'data': obj.data.decode('utf-8'),
355
+ 'rect': obj.rect
356
+ }
357
+ barcodes.append(barcode_info)
358
+
359
+ return barcodes
360
+
361
+ except Exception as e:
362
+ print(f"Error detecting barcodes: {str(e)}")
363
+ return []
364
+
365
+ def compare_colors(self, image1, image2):
366
+ """Compare colors between two images and return differences"""
367
+ try:
368
+ # Convert images to same size
369
+ img1 = np.array(image1)
370
+ img2 = np.array(image2)
371
+
372
+ # Resize images to same dimensions
373
+ height = min(img1.shape[0], img2.shape[0])
374
+ width = min(img1.shape[1], img2.shape[1])
375
+
376
+ img1_resized = cv2.resize(img1, (width, height))
377
+ img2_resized = cv2.resize(img2, (width, height))
378
+
379
+ # Convert to grayscale for comparison
380
+ gray1 = cv2.cvtColor(img1_resized, cv2.COLOR_RGB2GRAY)
381
+ gray2 = cv2.cvtColor(img2_resized, cv2.COLOR_RGB2GRAY)
382
+
383
+ # Calculate structural similarity
384
+ (score, diff) = ssim(gray1, gray2, full=True)
385
+
386
+ # Convert difference to binary mask
387
+ diff = (diff * 255).astype("uint8")
388
+ thresh = cv2.threshold(diff, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1]
389
+
390
+ # Find contours of differences
391
+ contours, _ = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
392
+
393
+ color_differences = []
394
+ for contour in contours:
395
+ if cv2.contourArea(contour) > 100: # Filter small differences
396
+ x, y, w, h = cv2.boundingRect(contour)
397
+ color_differences.append({
398
+ 'x': x,
399
+ 'y': y,
400
+ 'width': w,
401
+ 'height': h,
402
+ 'area': cv2.contourArea(contour)
403
+ })
404
+
405
+ return color_differences
406
+
407
+ except Exception as e:
408
+ print(f"Error comparing colors: {str(e)}")
409
+ return []
410
+
411
+ def create_annotated_image(self, image, differences, output_path):
412
+ """Create annotated image with red boxes around differences"""
413
+ try:
414
+ # Create a copy of the image
415
+ annotated_image = image.copy()
416
+ draw = ImageDraw.Draw(annotated_image)
417
+
418
+ # Draw red rectangles around differences
419
+ for diff in differences:
420
+ x, y, w, h = diff['x'], diff['y'], diff['width'], diff['height']
421
+ draw.rectangle([x, y, x + w, y + h], outline='red', width=3)
422
+
423
+ # Save annotated image
424
+ annotated_image.save(output_path)
425
+
426
+ except Exception as e:
427
+ print(f"Error creating annotated image: {str(e)}")
428
+
429
+ def compare_pdfs(self, pdf1_path, pdf2_path, session_id):
430
+ """Main comparison function"""
431
+ try:
432
+ # Validate both PDFs contain "50 Carroll"
433
+ if not self.validate_pdf(pdf1_path):
434
+ raise Exception("INVALID DOCUMENT")
435
+
436
+ if not self.validate_pdf(pdf2_path):
437
+ raise Exception("INVALID DOCUMENT")
438
+
439
+ # Extract text and images from both PDFs
440
+ pdf1_data = self.extract_text_from_pdf(pdf1_path)
441
+ pdf2_data = self.extract_text_from_pdf(pdf2_path)
442
+
443
+ # Initialize results
444
+ results = {
445
+ 'session_id': session_id,
446
+ 'validation': {
447
+ 'pdf1_valid': True,
448
+ 'pdf2_valid': True,
449
+ 'validation_text': '50 Carroll'
450
+ },
451
+ 'text_comparison': [],
452
+ 'spelling_issues': [],
453
+ 'barcodes_qr_codes': [],
454
+ 'color_differences': [],
455
+ 'annotated_images': []
456
+ }
457
+
458
+ # Compare text and check spelling
459
+ for i, (page1, page2) in enumerate(zip(pdf1_data, pdf2_data)):
460
+ page_results = {
461
+ 'page': i + 1,
462
+ 'text_differences': [],
463
+ 'spelling_issues_pdf1': [],
464
+ 'spelling_issues_pdf2': [],
465
+ 'barcodes_pdf1': [],
466
+ 'barcodes_pdf2': [],
467
+ 'color_differences': []
468
+ }
469
+
470
+ # Check spelling for both PDFs
471
+ page_results['spelling_issues_pdf1'] = self.check_spelling(page1['text'])
472
+ page_results['spelling_issues_pdf2'] = self.check_spelling(page2['text'])
473
+
474
+ # Create spelling-only annotated images (one box per error)
475
+ spell_dir = f'static/results/{session_id}'
476
+ os.makedirs(spell_dir, exist_ok=True)
477
+ spell_img1 = page1['image'].copy()
478
+ spell_img2 = page2['image'].copy()
479
+ spell_img1 = self.annotate_spelling_errors_on_image(spell_img1, page_results['spelling_issues_pdf1'])
480
+ spell_img2 = self.annotate_spelling_errors_on_image(spell_img2, page_results['spelling_issues_pdf2'])
481
+ spell_path1 = f'{spell_dir}/page_{i+1}_pdf1_spelling.png'
482
+ spell_path2 = f'{spell_dir}/page_{i+1}_pdf2_spelling.png'
483
+ spell_img1.save(spell_path1)
484
+ spell_img2.save(spell_path2)
485
+
486
+ # Detect barcodes and QR codes
487
+ page_results['barcodes_pdf1'] = self.detect_barcodes_qr_codes(page1['image'])
488
+ page_results['barcodes_pdf2'] = self.detect_barcodes_qr_codes(page2['image'])
489
+
490
+ # Compare colors
491
+ color_diffs = self.compare_colors(page1['image'], page2['image'])
492
+ page_results['color_differences'] = color_diffs
493
+
494
+ # Create annotated images
495
+ if color_diffs:
496
+ output_dir = f'static/results/{session_id}'
497
+ os.makedirs(output_dir, exist_ok=True)
498
+
499
+ annotated_path1 = f'{output_dir}/page_{i+1}_pdf1_annotated.png'
500
+ annotated_path2 = f'{output_dir}/page_{i+1}_pdf2_annotated.png'
501
+
502
+ self.create_annotated_image(page1['image'], color_diffs, annotated_path1)
503
+ self.create_annotated_image(page2['image'], color_diffs, annotated_path2)
504
+
505
+ page_results['annotated_images'] = {
506
+ 'pdf1': f'results/{session_id}/page_{i+1}_pdf1_annotated.png',
507
+ 'pdf2': f'results/{session_id}/page_{i+1}_pdf2_annotated.png',
508
+ 'pdf1_spelling': f'results/{session_id}/page_{i+1}_pdf1_spelling.png',
509
+ 'pdf2_spelling': f'results/{session_id}/page_{i+1}_pdf2_spelling.png'
510
+ }
511
+ else:
512
+ # If no color differences, still save spelling images
513
+ page_results['annotated_images'] = {
514
+ 'pdf1_spelling': f'results/{session_id}/page_{i+1}_pdf1_spelling.png',
515
+ 'pdf2_spelling': f'results/{session_id}/page_{i+1}_pdf2_spelling.png'
516
+ }
517
+
518
+ # Add spelling issues summary to text differences
519
+ if page_results['spelling_issues_pdf1'] or page_results['spelling_issues_pdf2']:
520
+ page_results['text_differences'].append({
521
+ 'type': 'spelling',
522
+ 'pdf1_issues': len(page_results['spelling_issues_pdf1']),
523
+ 'pdf2_issues': len(page_results['spelling_issues_pdf2']),
524
+ 'details': {
525
+ 'pdf1': [issue['word'] for issue in page_results['spelling_issues_pdf1']],
526
+ 'pdf2': [issue['word'] for issue in page_results['spelling_issues_pdf2']]
527
+ }
528
+ })
529
+
530
+ results['text_comparison'].append(page_results)
531
+
532
+ # Aggregate spelling issues
533
+ all_spelling_issues = []
534
+ for page in results['text_comparison']:
535
+ all_spelling_issues.extend(page['spelling_issues_pdf1'])
536
+ all_spelling_issues.extend(page['spelling_issues_pdf2'])
537
+
538
+ results['spelling_issues'] = all_spelling_issues
539
+
540
+ # Aggregate barcodes and QR codes
541
+ all_barcodes = []
542
+ for page in results['text_comparison']:
543
+ all_barcodes.extend(page['barcodes_pdf1'])
544
+ all_barcodes.extend(page['barcodes_pdf2'])
545
+
546
+ results['barcodes_qr_codes'] = all_barcodes
547
+
548
+ return results
549
+
550
+ except Exception as e:
551
+ raise Exception(f"Error comparing PDFs: {str(e)}")
requirements.txt ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Flask==2.3.3
2
+ Werkzeug==2.3.7
3
+ PyPDF2==3.0.1
4
+ pdf2image==1.16.3
5
+ Pillow==10.0.1
6
+ opencv-python==4.8.1.78
7
+ pytesseract==0.3.10
8
+ pyzbar==0.1.9
9
+ pyspellchecker==0.7.2
10
+ nltk==3.8.1
11
+ numpy==1.24.3
12
+ scikit-image==0.21.0
13
+ matplotlib==3.7.2
14
+ pandas==2.0.3
15
+ reportlab==4.0.4
16
+ regex==2023.10.3
run.py ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Startup script for PDF Comparison Tool
4
+ """
5
+
6
+ import os
7
+ import sys
8
+ import subprocess
9
+ import webbrowser
10
+ import time
11
+ from pathlib import Path
12
+
13
+ def check_python_version():
14
+ """Check if Python version is compatible"""
15
+ if sys.version_info < (3, 7):
16
+ print("❌ Python 3.7 or higher is required")
17
+ print(f"Current version: {sys.version}")
18
+ return False
19
+ print(f"βœ… Python {sys.version.split()[0]} is compatible")
20
+ return True
21
+
22
+ def check_dependencies():
23
+ """Check if required dependencies are installed"""
24
+ try:
25
+ import flask
26
+ import cv2
27
+ import numpy
28
+ import PIL
29
+ import pytesseract
30
+ import pdf2image
31
+ import pyzbar
32
+ import spellchecker
33
+ import nltk
34
+ import skimage
35
+ print("βœ… All Python dependencies are installed")
36
+ return True
37
+ except ImportError as e:
38
+ print(f"❌ Missing dependency: {e}")
39
+ print("Please run: pip install -r requirements.txt")
40
+ return False
41
+
42
+ def check_tesseract():
43
+ """Check if Tesseract OCR is installed"""
44
+ try:
45
+ import pytesseract
46
+ pytesseract.get_tesseract_version()
47
+ print("βœ… Tesseract OCR is available")
48
+ return True
49
+ except Exception as e:
50
+ print(f"❌ Tesseract OCR not found: {e}")
51
+ print("Please install Tesseract:")
52
+ print(" macOS: brew install tesseract")
53
+ print(" Ubuntu: sudo apt-get install tesseract-ocr")
54
+ print(" Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki")
55
+ return False
56
+
57
+ def create_directories():
58
+ """Create necessary directories"""
59
+ directories = ['uploads', 'results', 'static/results']
60
+ for directory in directories:
61
+ Path(directory).mkdir(parents=True, exist_ok=True)
62
+ print("βœ… Directories created")
63
+
64
+ def start_application():
65
+ """Start the Flask application"""
66
+ print("\nπŸš€ Starting PDF Comparison Tool...")
67
+ print("πŸ“± The application will be available at: http://localhost:5000")
68
+ print("⏹️ Press Ctrl+C to stop the application")
69
+ print("-" * 50)
70
+
71
+ try:
72
+ # Start the Flask app
73
+ from app import app
74
+ app.run(debug=True, host='0.0.0.0', port=5000)
75
+ except KeyboardInterrupt:
76
+ print("\nπŸ‘‹ Application stopped by user")
77
+ except Exception as e:
78
+ print(f"❌ Error starting application: {e}")
79
+ return False
80
+
81
+ return True
82
+
83
+ def main():
84
+ """Main startup function"""
85
+ print("=" * 50)
86
+ print("πŸ“„ PDF Comparison Tool")
87
+ print("=" * 50)
88
+
89
+ # Check requirements
90
+ if not check_python_version():
91
+ sys.exit(1)
92
+
93
+ if not check_dependencies():
94
+ sys.exit(1)
95
+
96
+ if not check_tesseract():
97
+ sys.exit(1)
98
+
99
+ # Create directories
100
+ create_directories()
101
+
102
+ # Ask user if they want to open browser
103
+ try:
104
+ response = input("\n🌐 Open browser automatically? (y/n): ").lower().strip()
105
+ if response in ['y', 'yes']:
106
+ # Wait a moment for the server to start
107
+ def open_browser():
108
+ time.sleep(2)
109
+ webbrowser.open('http://localhost:5000')
110
+
111
+ import threading
112
+ browser_thread = threading.Thread(target=open_browser)
113
+ browser_thread.daemon = True
114
+ browser_thread.start()
115
+ except KeyboardInterrupt:
116
+ print("\nπŸ‘‹ Setup cancelled by user")
117
+ sys.exit(0)
118
+
119
+ # Start the application
120
+ start_application()
121
+
122
+ if __name__ == "__main__":
123
+ main()
static/css/style.css ADDED
@@ -0,0 +1,228 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /* Custom styles for PDF Comparison Tool */
2
+
3
+ body {
4
+ background-color: #f8f9fa;
5
+ font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
6
+ }
7
+
8
+ .navbar-brand {
9
+ font-weight: 600;
10
+ font-size: 1.5rem;
11
+ }
12
+
13
+ .card {
14
+ border: none;
15
+ border-radius: 12px;
16
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
17
+ transition: transform 0.2s ease-in-out;
18
+ }
19
+
20
+ .card:hover {
21
+ transform: translateY(-2px);
22
+ }
23
+
24
+ .card-header {
25
+ border-radius: 12px 12px 0 0 !important;
26
+ border-bottom: none;
27
+ font-weight: 600;
28
+ }
29
+
30
+ .btn-primary {
31
+ background: linear-gradient(135deg, #007bff, #0056b3);
32
+ border: none;
33
+ border-radius: 8px;
34
+ font-weight: 600;
35
+ padding: 12px 24px;
36
+ transition: all 0.3s ease;
37
+ }
38
+
39
+ .btn-primary:hover {
40
+ background: linear-gradient(135deg, #0056b3, #004085);
41
+ transform: translateY(-1px);
42
+ box-shadow: 0 4px 8px rgba(0, 123, 255, 0.3);
43
+ }
44
+
45
+ .form-control {
46
+ border-radius: 8px;
47
+ border: 2px solid #e9ecef;
48
+ padding: 12px 16px;
49
+ transition: border-color 0.3s ease;
50
+ }
51
+
52
+ .form-control:focus {
53
+ border-color: #007bff;
54
+ box-shadow: 0 0 0 0.2rem rgba(0, 123, 255, 0.25);
55
+ }
56
+
57
+ .nav-tabs .nav-link {
58
+ border: none;
59
+ border-radius: 8px 8px 0 0;
60
+ color: #6c757d;
61
+ font-weight: 500;
62
+ padding: 12px 20px;
63
+ transition: all 0.3s ease;
64
+ }
65
+
66
+ .nav-tabs .nav-link:hover {
67
+ color: #007bff;
68
+ background-color: #f8f9fa;
69
+ }
70
+
71
+ .nav-tabs .nav-link.active {
72
+ background-color: #007bff;
73
+ color: white;
74
+ border: none;
75
+ }
76
+
77
+ .alert {
78
+ border-radius: 8px;
79
+ border: none;
80
+ font-weight: 500;
81
+ }
82
+
83
+ .spinner-border {
84
+ width: 3rem;
85
+ height: 3rem;
86
+ }
87
+
88
+ .progress {
89
+ height: 8px;
90
+ border-radius: 4px;
91
+ }
92
+
93
+ .progress-bar {
94
+ border-radius: 4px;
95
+ }
96
+
97
+ /* Comparison results styling */
98
+ .comparison-image {
99
+ max-width: 100%;
100
+ height: auto;
101
+ border-radius: 8px;
102
+ box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);
103
+ margin: 10px 0;
104
+ }
105
+
106
+ .difference-box {
107
+ border: 3px solid #dc3545;
108
+ border-radius: 4px;
109
+ position: relative;
110
+ }
111
+
112
+ .difference-box::after {
113
+ content: "Difference";
114
+ position: absolute;
115
+ top: -10px;
116
+ left: 10px;
117
+ background: #dc3545;
118
+ color: white;
119
+ padding: 2px 8px;
120
+ border-radius: 4px;
121
+ font-size: 12px;
122
+ font-weight: bold;
123
+ }
124
+
125
+ /* Table styling */
126
+ .table {
127
+ border-radius: 8px;
128
+ overflow: hidden;
129
+ box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);
130
+ }
131
+
132
+ .table thead th {
133
+ background-color: #f8f9fa;
134
+ border-bottom: 2px solid #dee2e6;
135
+ font-weight: 600;
136
+ color: #495057;
137
+ }
138
+
139
+ .table tbody tr:hover {
140
+ background-color: #f8f9fa;
141
+ }
142
+
143
+ /* Badge styling */
144
+ .badge {
145
+ font-size: 0.8em;
146
+ padding: 6px 10px;
147
+ border-radius: 6px;
148
+ }
149
+
150
+ .badge-danger {
151
+ background-color: #dc3545;
152
+ }
153
+
154
+ .badge-warning {
155
+ background-color: #ffc107;
156
+ color: #212529;
157
+ }
158
+
159
+ .badge-success {
160
+ background-color: #28a745;
161
+ }
162
+
163
+ .badge-info {
164
+ background-color: #17a2b8;
165
+ }
166
+
167
+ /* Responsive design */
168
+ @media (max-width: 768px) {
169
+ .container {
170
+ padding: 0 15px;
171
+ }
172
+
173
+ .card {
174
+ margin-bottom: 20px;
175
+ }
176
+
177
+ .nav-tabs .nav-link {
178
+ padding: 8px 12px;
179
+ font-size: 14px;
180
+ }
181
+
182
+ .btn-lg {
183
+ padding: 10px 20px;
184
+ font-size: 16px;
185
+ }
186
+ }
187
+
188
+ /* Loading animation */
189
+ @keyframes pulse {
190
+ 0% { opacity: 1; }
191
+ 50% { opacity: 0.5; }
192
+ 100% { opacity: 1; }
193
+ }
194
+
195
+ .loading-pulse {
196
+ animation: pulse 1.5s infinite;
197
+ }
198
+
199
+ /* Custom scrollbar */
200
+ ::-webkit-scrollbar {
201
+ width: 8px;
202
+ }
203
+
204
+ ::-webkit-scrollbar-track {
205
+ background: #f1f1f1;
206
+ border-radius: 4px;
207
+ }
208
+
209
+ ::-webkit-scrollbar-thumb {
210
+ background: #c1c1c1;
211
+ border-radius: 4px;
212
+ }
213
+
214
+ ::-webkit-scrollbar-thumb:hover {
215
+ background: #a8a8a8;
216
+ }
217
+
218
+ /* Print styles */
219
+ @media print {
220
+ .navbar, .btn, .nav-tabs {
221
+ display: none !important;
222
+ }
223
+
224
+ .card {
225
+ box-shadow: none !important;
226
+ border: 1px solid #dee2e6 !important;
227
+ }
228
+ }
static/js/script.js ADDED
@@ -0,0 +1,242 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // PDF Comparison Tool JavaScript
2
+
3
+ document.addEventListener('DOMContentLoaded', function() {
4
+ const uploadForm = document.getElementById('uploadForm');
5
+ const loadingSection = document.getElementById('loadingSection');
6
+ const resultsSection = document.getElementById('resultsSection');
7
+ const errorSection = document.getElementById('errorSection');
8
+ const errorMessage = document.getElementById('errorMessage');
9
+
10
+ // Handle form submission
11
+ uploadForm.addEventListener('submit', function(e) {
12
+ e.preventDefault();
13
+
14
+ const formData = new FormData(uploadForm);
15
+ const pdf1 = document.getElementById('pdf1').files[0];
16
+ const pdf2 = document.getElementById('pdf2').files[0];
17
+
18
+ // Validate files
19
+ if (!pdf1 || !pdf2) {
20
+ showError('Please select both PDF files.');
21
+ return;
22
+ }
23
+
24
+ if (!pdf1.name.toLowerCase().endsWith('.pdf') || !pdf2.name.toLowerCase().endsWith('.pdf')) {
25
+ showError('Please select valid PDF files.');
26
+ return;
27
+ }
28
+
29
+ // Show loading
30
+ showLoading();
31
+ hideError();
32
+
33
+ // Submit form via AJAX
34
+ fetch('/upload', {
35
+ method: 'POST',
36
+ body: formData
37
+ })
38
+ .then(response => response.json())
39
+ .then(data => {
40
+ hideLoading();
41
+
42
+ if (data.success) {
43
+ displayResults(data.results);
44
+ } else {
45
+ showError(data.error || 'An error occurred during comparison.');
46
+ }
47
+ })
48
+ .catch(error => {
49
+ hideLoading();
50
+ showError('Network error: ' + error.message);
51
+ });
52
+ });
53
+
54
+ function showLoading() {
55
+ loadingSection.style.display = 'block';
56
+ resultsSection.style.display = 'none';
57
+ errorSection.style.display = 'none';
58
+ }
59
+
60
+ function hideLoading() {
61
+ loadingSection.style.display = 'none';
62
+ }
63
+
64
+ function showError(message) {
65
+ errorMessage.textContent = message;
66
+ errorSection.style.display = 'block';
67
+ resultsSection.style.display = 'none';
68
+ }
69
+
70
+ function hideError() {
71
+ errorSection.style.display = 'none';
72
+ }
73
+
74
+ function displayResults(results) {
75
+ resultsSection.style.display = 'block';
76
+
77
+ // Display visual comparison
78
+ displayVisualComparison(results);
79
+
80
+ // Display spelling issues
81
+ displaySpellingIssues(results);
82
+
83
+ // Display barcodes and QR codes
84
+ displayBarcodes(results);
85
+ }
86
+
87
+ function displayVisualComparison(results) {
88
+ const visualContent = document.getElementById('visualComparisonContent');
89
+ let html = '<div class="row">';
90
+
91
+ if (results.text_comparison && results.text_comparison.length > 0) {
92
+ results.text_comparison.forEach((page, index) => {
93
+ html += `
94
+ <div class="col-12 mb-4">
95
+ <h6 class="text-primary mb-3">Page ${page.page}</h6>
96
+ <div class="row">
97
+ <div class="col-md-6">
98
+ <h6>PDF 1</h6>
99
+ ${page.annotated_images && page.annotated_images.pdf1 ?
100
+ `<img src="/static/${page.annotated_images.pdf1}" class="comparison-image" alt="PDF 1 Page ${page.page}">` :
101
+ '<p class="text-muted">No differences detected</p>'
102
+ }
103
+ </div>
104
+ <div class="col-md-6">
105
+ <h6>PDF 2</h6>
106
+ ${page.annotated_images && page.annotated_images.pdf2 ?
107
+ `<img src="/static/${page.annotated_images.pdf2}" class="comparison-image" alt="PDF 2 Page ${page.page}">` :
108
+ '<p class="text-muted">No differences detected</p>'
109
+ }
110
+ </div>
111
+ </div>
112
+ ${page.color_differences && page.color_differences.length > 0 ?
113
+ `<div class="mt-3">
114
+ <span class="badge badge-danger">${page.color_differences.length} color difference(s) detected</span>
115
+ </div>` :
116
+ '<div class="mt-3"><span class="badge badge-success">No color differences</span></div>'
117
+ }
118
+ </div>
119
+ `;
120
+ });
121
+ } else {
122
+ html += '<div class="col-12"><p class="text-muted">No visual comparison data available.</p></div>';
123
+ }
124
+
125
+ html += '</div>';
126
+ visualContent.innerHTML = html;
127
+ }
128
+
129
+ function displaySpellingIssues(results) {
130
+ const spellingContent = document.getElementById('spellingIssuesContent');
131
+ let html = '';
132
+
133
+ if (results.spelling_issues && results.spelling_issues.length > 0) {
134
+ html += `
135
+ <div class="table-responsive">
136
+ <table class="table table-striped">
137
+ <thead>
138
+ <tr>
139
+ <th>Word</th>
140
+ <th>Original</th>
141
+ <th>English Suggestions</th>
142
+ <th>French Suggestions</th>
143
+ </tr>
144
+ </thead>
145
+ <tbody>
146
+ `;
147
+
148
+ results.spelling_issues.forEach(issue => {
149
+ const englishSuggestions = issue.suggestions.english.join(', ') || 'None';
150
+ const frenchSuggestions = issue.suggestions.french.join(', ') || 'None';
151
+
152
+ html += `
153
+ <tr>
154
+ <td><strong>${issue.word}</strong></td>
155
+ <td><code>${issue.original_word}</code></td>
156
+ <td>${englishSuggestions}</td>
157
+ <td>${frenchSuggestions}</td>
158
+ </tr>
159
+ `;
160
+ });
161
+
162
+ html += `
163
+ </tbody>
164
+ </table>
165
+ </div>
166
+ <div class="mt-3">
167
+ <span class="badge badge-warning">${results.spelling_issues.length} spelling issue(s) found</span>
168
+ </div>
169
+ `;
170
+ } else {
171
+ html = '<div class="alert alert-success"><i class="fas fa-check me-2"></i>No spelling issues detected.</div>';
172
+ }
173
+
174
+ spellingContent.innerHTML = html;
175
+ }
176
+
177
+ function displayBarcodes(results) {
178
+ const barcodesContent = document.getElementById('barcodesContent');
179
+ let html = '';
180
+
181
+ if (results.barcodes_qr_codes && results.barcodes_qr_codes.length > 0) {
182
+ html += `
183
+ <div class="table-responsive">
184
+ <table class="table table-striped">
185
+ <thead>
186
+ <tr>
187
+ <th>Type</th>
188
+ <th>Data</th>
189
+ <th>Position</th>
190
+ </tr>
191
+ </thead>
192
+ <tbody>
193
+ `;
194
+
195
+ results.barcodes_qr_codes.forEach(barcode => {
196
+ const position = `(${barcode.rect.left}, ${barcode.rect.top}) - (${barcode.rect.left + barcode.rect.width}, ${barcode.rect.top + barcode.rect.height})`;
197
+
198
+ html += `
199
+ <tr>
200
+ <td><span class="badge badge-info">${barcode.type}</span></td>
201
+ <td><code>${barcode.data}</code></td>
202
+ <td>${position}</td>
203
+ </tr>
204
+ `;
205
+ });
206
+
207
+ html += `
208
+ </tbody>
209
+ </table>
210
+ </div>
211
+ <div class="mt-3">
212
+ <span class="badge badge-info">${results.barcodes_qr_codes.length} barcode/QR code(s) detected</span>
213
+ </div>
214
+ `;
215
+ } else {
216
+ html = '<div class="alert alert-info"><i class="fas fa-info-circle me-2"></i>No barcodes or QR codes detected.</div>';
217
+ }
218
+
219
+ barcodesContent.innerHTML = html;
220
+ }
221
+
222
+ // Add file input change handlers for better UX
223
+ document.getElementById('pdf1').addEventListener('change', function(e) {
224
+ const file = e.target.files[0];
225
+ if (file) {
226
+ const label = e.target.nextElementSibling;
227
+ if (label && label.classList.contains('form-text')) {
228
+ label.textContent = `Selected: ${file.name}`;
229
+ }
230
+ }
231
+ });
232
+
233
+ document.getElementById('pdf2').addEventListener('change', function(e) {
234
+ const file = e.target.files[0];
235
+ if (file) {
236
+ const label = e.target.nextElementSibling;
237
+ if (label && label.classList.contains('form-text')) {
238
+ label.textContent = `Selected: ${file.name}`;
239
+ }
240
+ }
241
+ });
242
+ });
templates/index.html ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>PDF Comparison Tool</title>
7
+ <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css" rel="stylesheet">
8
+ <link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/all.min.css" rel="stylesheet">
9
+ <link href="{{ url_for('static', filename='css/style.css') }}" rel="stylesheet">
10
+ </head>
11
+ <body>
12
+ <div class="container-fluid">
13
+ <div class="row">
14
+ <!-- Header -->
15
+ <div class="col-12">
16
+ <nav class="navbar navbar-expand-lg navbar-dark bg-primary">
17
+ <div class="container">
18
+ <a class="navbar-brand" href="#">
19
+ <i class="fas fa-file-pdf me-2"></i>
20
+ PDF Comparison Tool
21
+ </a>
22
+ </div>
23
+ </nav>
24
+ </div>
25
+ </div>
26
+
27
+ <div class="row mt-4">
28
+ <div class="col-12">
29
+ <div class="container">
30
+ <!-- Upload Section -->
31
+ <div class="card shadow-sm">
32
+ <div class="card-header bg-light">
33
+ <h5 class="mb-0">
34
+ <i class="fas fa-upload me-2"></i>
35
+ Upload PDF Files for Comparison
36
+ </h5>
37
+ </div>
38
+ <div class="card-body">
39
+ <form id="uploadForm" enctype="multipart/form-data">
40
+ <div class="row">
41
+ <div class="col-md-6">
42
+ <div class="mb-3">
43
+ <label for="pdf1" class="form-label">First PDF File</label>
44
+ <input type="file" class="form-control" id="pdf1" name="pdf1" accept=".pdf" required>
45
+ <div class="form-text">Select a PDF file for comparison</div>
46
+ </div>
47
+ </div>
48
+ <div class="col-md-6">
49
+ <div class="mb-3">
50
+ <label for="pdf2" class="form-label">Second PDF File</label>
51
+ <input type="file" class="form-control" id="pdf2" name="pdf2" accept=".pdf" required>
52
+ <div class="form-text">Select a PDF file for comparison</div>
53
+ </div>
54
+ </div>
55
+ </div>
56
+ <div class="d-grid">
57
+ <button type="submit" class="btn btn-primary btn-lg">
58
+ <i class="fas fa-search me-2"></i>
59
+ Compare PDFs
60
+ </button>
61
+ </div>
62
+ </form>
63
+ </div>
64
+ </div>
65
+
66
+ <!-- Loading Section -->
67
+ <div id="loadingSection" class="card shadow-sm mt-4" style="display: none;">
68
+ <div class="card-body text-center">
69
+ <div class="spinner-border text-primary" role="status">
70
+ <span class="visually-hidden">Loading...</span>
71
+ </div>
72
+ <p class="mt-3">Processing PDFs... This may take a few minutes.</p>
73
+ <div class="progress mt-3">
74
+ <div class="progress-bar progress-bar-striped progress-bar-animated" role="progressbar" style="width: 100%"></div>
75
+ </div>
76
+ </div>
77
+ </div>
78
+
79
+ <!-- Results Section -->
80
+ <div id="resultsSection" class="mt-4" style="display: none;">
81
+ <!-- Comparison Results Tabs -->
82
+ <div class="card shadow-sm">
83
+ <div class="card-header">
84
+ <ul class="nav nav-tabs card-header-tabs" id="resultsTabs" role="tablist">
85
+ <li class="nav-item" role="presentation">
86
+ <button class="nav-link active" id="visual-tab" data-bs-toggle="tab" data-bs-target="#visual" type="button" role="tab">
87
+ <i class="fas fa-eye me-2"></i>Visual Comparison
88
+ </button>
89
+ </li>
90
+ <li class="nav-item" role="presentation">
91
+ <button class="nav-link" id="spelling-tab" data-bs-toggle="tab" data-bs-target="#spelling" type="button" role="tab">
92
+ <i class="fas fa-spell-check me-2"></i>Spelling Issues
93
+ </button>
94
+ </li>
95
+ <li class="nav-item" role="presentation">
96
+ <button class="nav-link" id="barcodes-tab" data-bs-toggle="tab" data-bs-target="#barcodes" type="button" role="tab">
97
+ <i class="fas fa-barcode me-2"></i>Barcodes & QR Codes
98
+ </button>
99
+ </li>
100
+ </ul>
101
+ </div>
102
+ <div class="card-body">
103
+ <div class="tab-content" id="resultsTabContent">
104
+ <!-- Visual Comparison Tab -->
105
+ <div class="tab-pane fade show active" id="visual" role="tabpanel">
106
+ <div id="visualComparisonContent">
107
+ <!-- Content will be populated by JavaScript -->
108
+ </div>
109
+ </div>
110
+
111
+ <!-- Spelling Issues Tab -->
112
+ <div class="tab-pane fade" id="spelling" role="tabpanel">
113
+ <div id="spellingIssuesContent">
114
+ <!-- Content will be populated by JavaScript -->
115
+ </div>
116
+ </div>
117
+
118
+ <!-- Barcodes Tab -->
119
+ <div class="tab-pane fade" id="barcodes" role="tabpanel">
120
+ <div id="barcodesContent">
121
+ <!-- Content will be populated by JavaScript -->
122
+ </div>
123
+ </div>
124
+ </div>
125
+ </div>
126
+ </div>
127
+ </div>
128
+
129
+ <!-- Error Section -->
130
+ <div id="errorSection" class="alert alert-danger mt-4" style="display: none;">
131
+ <i class="fas fa-exclamation-triangle me-2"></i>
132
+ <span id="errorMessage"></span>
133
+ </div>
134
+ </div>
135
+ </div>
136
+ </div>
137
+ </div>
138
+
139
+ <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/js/bootstrap.bundle.min.js"></script>
140
+ <script src="{{ url_for('static', filename='js/script.js') }}"></script>
141
+ </body>
142
+ </html>
test_setup.py ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script to verify PDF Comparison Tool setup
4
+ """
5
+
6
+ import sys
7
+ import importlib
8
+
9
+ def test_imports():
10
+ """Test if all required packages can be imported"""
11
+ required_packages = [
12
+ 'flask',
13
+ 'cv2',
14
+ 'numpy',
15
+ 'PIL',
16
+ 'pytesseract',
17
+ 'pdf2image',
18
+ 'pyzbar',
19
+ 'spellchecker',
20
+ 'nltk',
21
+ 'skimage',
22
+ 'matplotlib',
23
+ 'pandas'
24
+ ]
25
+
26
+ print("Testing package imports...")
27
+ failed_imports = []
28
+
29
+ for package in required_packages:
30
+ try:
31
+ importlib.import_module(package)
32
+ print(f"βœ“ {package}")
33
+ except ImportError as e:
34
+ print(f"βœ— {package}: {e}")
35
+ failed_imports.append(package)
36
+
37
+ return failed_imports
38
+
39
+ def test_tesseract():
40
+ """Test if Tesseract OCR is available"""
41
+ print("\nTesting Tesseract OCR...")
42
+ try:
43
+ import pytesseract
44
+ # Try to get Tesseract version
45
+ version = pytesseract.get_tesseract_version()
46
+ print(f"βœ“ Tesseract version: {version}")
47
+ return True
48
+ except Exception as e:
49
+ print(f"βœ— Tesseract not found: {e}")
50
+ print("Please install Tesseract OCR:")
51
+ print(" macOS: brew install tesseract")
52
+ print(" Ubuntu: sudo apt-get install tesseract-ocr")
53
+ print(" Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki")
54
+ return False
55
+
56
+ def test_pdf_comparator():
57
+ """Test if PDFComparator class can be instantiated"""
58
+ print("\nTesting PDFComparator...")
59
+ try:
60
+ from pdf_comparator import PDFComparator
61
+ comparator = PDFComparator()
62
+ print("βœ“ PDFComparator initialized successfully")
63
+ return True
64
+ except Exception as e:
65
+ print(f"βœ— PDFComparator error: {e}")
66
+ return False
67
+
68
+ def test_flask_app():
69
+ """Test if Flask app can be imported"""
70
+ print("\nTesting Flask application...")
71
+ try:
72
+ from app import app
73
+ print("βœ“ Flask app imported successfully")
74
+ return True
75
+ except Exception as e:
76
+ print(f"βœ— Flask app error: {e}")
77
+ return False
78
+
79
+ def main():
80
+ """Run all tests"""
81
+ print("PDF Comparison Tool - Setup Test")
82
+ print("=" * 40)
83
+
84
+ # Test imports
85
+ failed_imports = test_imports()
86
+
87
+ # Test Tesseract
88
+ tesseract_ok = test_tesseract()
89
+
90
+ # Test PDFComparator
91
+ comparator_ok = test_pdf_comparator()
92
+
93
+ # Test Flask app
94
+ flask_ok = test_flask_app()
95
+
96
+ # Summary
97
+ print("\n" + "=" * 40)
98
+ print("SETUP SUMMARY")
99
+ print("=" * 40)
100
+
101
+ if failed_imports:
102
+ print(f"βœ— Missing packages: {', '.join(failed_imports)}")
103
+ print("Run: pip install -r requirements.txt")
104
+ else:
105
+ print("βœ“ All packages imported successfully")
106
+
107
+ if tesseract_ok:
108
+ print("βœ“ Tesseract OCR is available")
109
+ else:
110
+ print("βœ— Tesseract OCR is not available")
111
+
112
+ if comparator_ok:
113
+ print("βœ“ PDFComparator is working")
114
+ else:
115
+ print("βœ— PDFComparator has issues")
116
+
117
+ if flask_ok:
118
+ print("βœ“ Flask application is ready")
119
+ else:
120
+ print("βœ— Flask application has issues")
121
+
122
+ # Overall status
123
+ all_ok = not failed_imports and tesseract_ok and comparator_ok and flask_ok
124
+
125
+ if all_ok:
126
+ print("\nπŸŽ‰ Setup is complete! You can run the application with:")
127
+ print(" python app.py")
128
+ else:
129
+ print("\n⚠️ Setup is incomplete. Please fix the issues above.")
130
+ sys.exit(1)
131
+
132
+ if __name__ == "__main__":
133
+ main()