This guide explains how to integrate publicly available Epstein-related files from official government sources into the Hub.
URL: https://vault.fbi.gov/jeffrey-epstein
Files:
Access: Publicly available, no authentication required
URL: Various DOJ and court document repositories
Content:
Access: Available through FOIA releases and court filings
Sources:
Content:
Script: scripts/fetch-public-files.py
Features:
Usage:
# Interactive mode
python scripts/fetch-public-files.py
# Non-interactive (for automation)
python scripts/fetch-public-files.py --non-interactive --source fbi_vault
Script: scripts/process-pdfs.py
Features:
Usage:
# Process all PDFs in data/public_files
python scripts/process-pdfs.py
# Process specific directory
python scripts/process-pdfs.py --input data/public_files/fbi_vault
Script: scripts/generate-search-index.py
Features:
Usage:
# Update search index with new documents
python scripts/generate-search-index.py --include-processed
# Install Python dependencies
pip install requests pypdf pytesseract pillow pdf2image
# Install system dependencies (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install tesseract-ocr poppler-utils
# For macOS
brew install tesseract poppler
# Run the fetch script
python scripts/fetch-public-files.py
# Follow prompts:
# - Select "FBI Vault" source
# - Confirm download (y/n)
# - Wait for download to complete
# Files will be saved to:
# data/public_files/fbi_vault/
Expected output:
π Fetching FBI Vault Files...
[1/22] FBI Vault Part 01
Downloading: https://vault.fbi.gov/...
β
Downloaded: 1,234.56 KB
[2/22] FBI Vault Part 02
...
β
FBI Vault: 22 files processed
# Extract text and metadata
python scripts/process-pdfs.py
# This will:
# - Extract text from each PDF
# - Perform OCR if needed
# - Extract dates, locations, case numbers
# - Generate metadata files
# - Create search-ready index
Output structure:
data/processed/
βββ text/
β βββ epstein-part-01.txt
β βββ epstein-part-02.txt
β βββ ...
βββ metadata/
β βββ epstein-part-01.json
β βββ epstein-part-02.json
β βββ ...
βββ indexed/
β βββ epstein-part-01.json
β βββ ...
βββ processing_summary.json
# Regenerate search index with new documents
python scripts/generate-search-index.py
# This updates:
# web/js/search-index.js
# web/js/search-stats.json
# web/js/search-metadata.json
# Add processed files
git add data/processed/
git add web/js/search-index.js
# Commit
git commit -m "Add FBI Vault documents (22 files processed)"
# Push to GitHub
git push
The repository includes an automated workflow that can fetch and process files monthly.
Workflow: .github/workflows/fetch-public-files.yml
Features:
Manual trigger:
# Install Git LFS
git lfs install
# Track PDF files
git lfs track "*.pdf"
git lfs track "data/public_files/**"
# Add and commit
git add .gitattributes
git commit -m "Configure Git LFS for PDFs"
Free tier limits:
Store raw PDFs externally, keep only processed text:
# Don't commit raw PDFs
echo "data/public_files/*.pdf" >> .gitignore
# Only commit processed text
git add data/processed/text/
git add data/processed/metadata/
Upload large files as GitHub Release assets:
# Create release
gh release create v1.0 \
--title "Public Files Archive" \
--notes "FBI Vault documents"
# Upload files
gh release upload v1.0 data/public_files/fbi_vault/*.pdf
Each processed document includes:
{
"file_name": "epstein-part-01.pdf",
"processed_date": "2024-12-20T08:00:00Z",
"word_count": 15234,
"char_count": 89456,
"page_count": 75,
"extraction_method": "text_extraction",
"dates_found": [
"1999-12-15",
"2006-03-22",
"2008-07-14"
],
"case_numbers": [
"CV-2019-001",
"INV-2019-9878"
],
"locations": [
"Little St. James",
"Palm Beach",
"Manhattan"
]
}
{
"id": "fbi-vault-01",
"title": "FBI Vault Part 01",
"content": "First 1000 characters of text...",
"full_text_path": "data/processed/text/epstein-part-01.txt",
"date": "2019-08-10",
"source": "FBI Vault",
"type": "Investigation Record",
"relevance": 95,
"metadata": { /* ... */ }
}
fetch-public-files.py:PUBLIC_SOURCES['court_docs'] = {
'name': 'Court Filings - SDNY',
'base_url': 'https://example.com/docs',
'files': [
'case-123-exhibit-a.pdf',
'case-123-exhibit-b.pdf'
]
}
def fetch_court_documents(self):
"""Fetch court documents"""
source = PUBLIC_SOURCES['court_docs']
# Implementation similar to FBI Vault
- name: Fetch court documents
run: |
python scripts/fetch-public-files.py --source court_docs
Solution:
Solution:
# Install Tesseract
sudo apt-get install tesseract-ocr
# Verify installation
tesseract --version
# Test OCR
python -c "import pytesseract; print('OCR ready')"
Solution:
Solution:
Official Sources:
Tools:
Documentation:
β Tools provided:
β Sources supported:
β Features:
Next steps:
python scripts/fetch-public-files.pypython scripts/process-pdfs.pyCost: $0 - All tools use free tier services!