This bot analyzes uploaded PDF files to determine if they contain Epstein-related content. Files with relevant content are processed and indexed; irrelevant files are moved to trash.
When a PDF is uploaded to /data/uploads/:
Upload → Extract Text → Analyze Content → Score Relevance → Route Document
relevance_score = (
keyword_matches * 0.3 +
entity_matches * 0.3 +
context_score * 0.4
)
Located in config/keywords.yml:
high_priority:
- "Jeffrey Epstein"
- "Ghislaine Maxwell"
- "Little St. James"
- "flight logs"
- # ... more terms
medium_priority:
- "Virgin Islands"
- "Palm Beach"
- # ... more terms
entities:
people:
- # Names from Character Directory
places:
- # Locations from Glossary
organizations:
- # Organizations
Located in config/pdf-analysis.yml:
thresholds:
accept: 70 # Auto-index if score >= 70
review: 40 # Manual review if 40 <= score < 70
reject: 40 # Auto-trash if score < 40
processing:
ocr_enabled: true
max_file_size: 100MB
timeout: 300 # seconds
ai_services:
azure_openai_endpoint: ${AZURE_OPENAI_ENDPOINT}
azure_document_intelligence: ${AZURE_DOC_INTEL_ENDPOINT}
/data/uploads/ folder# Analyze single file
python bots/pdf-analysis-bot/analyze.py --file path/to/document.pdf
# Analyze directory
python bots/pdf-analysis-bot/analyze.py --dir path/to/pdfs/
# With custom threshold
python bots/pdf-analysis-bot/analyze.py --file doc.pdf --threshold 60
{
"file": "document.pdf",
"timestamp": "2024-12-20T12:00:00Z",
"analysis": {
"relevance_score": 85,
"confidence": 0.92,
"decision": "ACCEPT",
"routing": "index"
},
"content": {
"page_count": 45,
"text_extracted": true,
"language": "en",
"has_images": true
},
"matches": {
"keywords": ["Jeffrey Epstein", "flight logs", "Little St. James"],
"entities": {
"people": ["Jane Doe", "John Smith"],
"places": ["Virgin Islands", "Palm Beach"],
"dates": ["2005-03-15", "2019-07-06"]
}
},
"metadata": {
"title": "Flight Manifests 2005",
"author": "Unknown",
"created": "2005-03-20",
"modified": "2005-03-20"
}
}
File: .github/workflows/pdf-analysis.yml
name: PDF Analysis
on:
push:
paths:
- 'data/uploads/**.pdf'
workflow_dispatch:
jobs:
analyze:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -r bots/pdf-analysis-bot/requirements.txt
- name: Analyze PDFs
env:
AZURE_OPENAI_KEY: $
AZURE_DOC_INTEL_KEY: $
run: |
python bots/pdf-analysis-bot/analyze.py --dir data/uploads/
- name: Process Results
run: |
python bots/pdf-analysis-bot/route.py
- name: Commit Changes
run: |
git config user.name "PDF Analysis Bot"
git config user.email "bot@github.com"
git add .
git commit -m "Process uploaded PDFs" || exit 0
git push
azure-ai-documentintelligence>=1.0.0
azure-ai-textanalytics>=5.3.0
openai>=1.0.0
pypdf>=3.9.0
pdf2image>=1.16.0
pytesseract>=0.3.10
pillow>=10.0.0
pyyaml>=6.0
python-magic>=0.4.27
logs/pdf-analysis/logs/errors/logs/access/OCR Fails:
Low Relevance Scores:
Upload Errors:
# Unit tests
pytest bots/pdf-analysis-bot/tests/
# Integration tests
pytest bots/pdf-analysis-bot/tests/integration/
# Test with sample file
python bots/pdf-analysis-bot/analyze.py --file tests/fixtures/sample.pdf --debug
See CONTRIBUTING.md for guidelines.
def analyze_pdf(file_path: str, threshold: int = 70) -> AnalysisResult:
"""
Analyze PDF for Epstein-related content.
Args:
file_path: Path to PDF file
threshold: Minimum relevance score (0-100)
Returns:
AnalysisResult object with scores and routing decision
"""
def route_document(analysis: AnalysisResult) -> str:
"""
Route document based on analysis.
Args:
analysis: AnalysisResult object
Returns:
Destination path (index, review, or trash)
"""
File: flight_log_2005.pdf
Score: 95
Decision: ACCEPT → Indexed
Reason: Contains multiple high-priority keywords and entities
File: random_recipe.pdf
Score: 5
Decision: REJECT → Trash
Reason: No relevant keywords or entities found
File: business_document.pdf
Score: 55
Decision: REVIEW → Manual Review Queue
Reason: Some matches but context unclear
This bot is continuously improved based on feedback and performance metrics.
Last Updated: December 2024