Hub_of_Epstein_Files_Directory

Public Files Integration Guide

Overview

This guide explains how to integrate publicly available Epstein-related files from official government sources into the Hub.

Available Sources

1. FBI Vault (22 PDF Files)

URL: https://vault.fbi.gov/jeffrey-epstein

Files:

Access: Publicly available, no authentication required

2. DOJ Flight Logs (Text Files)

URL: Various DOJ and court document repositories

Content:

Access: Available through FOIA releases and court filings

3. Court Filings

Sources:

Content:


Integration Tools

Tool 1: Fetch Public Files

Script: scripts/fetch-public-files.py

Features:

Usage:

# Interactive mode
python scripts/fetch-public-files.py

# Non-interactive (for automation)
python scripts/fetch-public-files.py --non-interactive --source fbi_vault

Tool 2: Process PDFs

Script: scripts/process-pdfs.py

Features:

Usage:

# Process all PDFs in data/public_files
python scripts/process-pdfs.py

# Process specific directory
python scripts/process-pdfs.py --input data/public_files/fbi_vault

Tool 3: Update Search Index

Script: scripts/generate-search-index.py

Features:

Usage:

# Update search index with new documents
python scripts/generate-search-index.py --include-processed

Step-by-Step Integration

Step 1: Install Dependencies

# Install Python dependencies
pip install requests pypdf pytesseract pillow pdf2image

# Install system dependencies (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install tesseract-ocr poppler-utils

# For macOS
brew install tesseract poppler

Step 2: Fetch FBI Vault Files

# Run the fetch script
python scripts/fetch-public-files.py

# Follow prompts:
# - Select "FBI Vault" source
# - Confirm download (y/n)
# - Wait for download to complete

# Files will be saved to:
# data/public_files/fbi_vault/

Expected output:

πŸ“ Fetching FBI Vault Files...
[1/22] FBI Vault Part 01
  Downloading: https://vault.fbi.gov/...
  βœ… Downloaded: 1,234.56 KB

[2/22] FBI Vault Part 02
  ...

βœ… FBI Vault: 22 files processed

Step 3: Process PDFs

# Extract text and metadata
python scripts/process-pdfs.py

# This will:
# - Extract text from each PDF
# - Perform OCR if needed
# - Extract dates, locations, case numbers
# - Generate metadata files
# - Create search-ready index

Output structure:

data/processed/
β”œβ”€β”€ text/
β”‚   β”œβ”€β”€ epstein-part-01.txt
β”‚   β”œβ”€β”€ epstein-part-02.txt
β”‚   └── ...
β”œβ”€β”€ metadata/
β”‚   β”œβ”€β”€ epstein-part-01.json
β”‚   β”œβ”€β”€ epstein-part-02.json
β”‚   └── ...
β”œβ”€β”€ indexed/
β”‚   β”œβ”€β”€ epstein-part-01.json
β”‚   └── ...
└── processing_summary.json

Step 4: Update Search Index

# Regenerate search index with new documents
python scripts/generate-search-index.py

# This updates:
# web/js/search-index.js
# web/js/search-stats.json
# web/js/search-metadata.json

Step 5: Commit Changes

# Add processed files
git add data/processed/
git add web/js/search-index.js

# Commit
git commit -m "Add FBI Vault documents (22 files processed)"

# Push to GitHub
git push

Automated Integration

GitHub Actions Workflow

The repository includes an automated workflow that can fetch and process files monthly.

Workflow: .github/workflows/fetch-public-files.yml

Features:

Manual trigger:

  1. Go to Actions tab in GitHub
  2. Select Fetch and Process Public Files
  3. Click Run workflow
  4. Choose options:
    • β˜‘οΈ Fetch FBI Vault files
    • β˜‘οΈ Fetch DOJ flight logs
  5. Click Run workflow

File Size Considerations

GitHub Repository Limits

FBI Vault Files

Solutions for Large Files

Option 1: Git LFS (Free Tier)

# Install Git LFS
git lfs install

# Track PDF files
git lfs track "*.pdf"
git lfs track "data/public_files/**"

# Add and commit
git add .gitattributes
git commit -m "Configure Git LFS for PDFs"

Free tier limits:

Option 2: External Storage

Store raw PDFs externally, keep only processed text:

# Don't commit raw PDFs
echo "data/public_files/*.pdf" >> .gitignore

# Only commit processed text
git add data/processed/text/
git add data/processed/metadata/

Option 3: Release Assets

Upload large files as GitHub Release assets:

# Create release
gh release create v1.0 \
  --title "Public Files Archive" \
  --notes "FBI Vault documents"

# Upload files
gh release upload v1.0 data/public_files/fbi_vault/*.pdf

Data Structure

Metadata Format

Each processed document includes:

{
  "file_name": "epstein-part-01.pdf",
  "processed_date": "2024-12-20T08:00:00Z",
  "word_count": 15234,
  "char_count": 89456,
  "page_count": 75,
  "extraction_method": "text_extraction",
  "dates_found": [
    "1999-12-15",
    "2006-03-22",
    "2008-07-14"
  ],
  "case_numbers": [
    "CV-2019-001",
    "INV-2019-9878"
  ],
  "locations": [
    "Little St. James",
    "Palm Beach",
    "Manhattan"
  ]
}

Search Index Format

{
  "id": "fbi-vault-01",
  "title": "FBI Vault Part 01",
  "content": "First 1000 characters of text...",
  "full_text_path": "data/processed/text/epstein-part-01.txt",
  "date": "2019-08-10",
  "source": "FBI Vault",
  "type": "Investigation Record",
  "relevance": 95,
  "metadata": { /* ... */ }
}

Adding New Sources

Example: Adding Court Documents

  1. Update fetch-public-files.py:
PUBLIC_SOURCES['court_docs'] = {
    'name': 'Court Filings - SDNY',
    'base_url': 'https://example.com/docs',
    'files': [
        'case-123-exhibit-a.pdf',
        'case-123-exhibit-b.pdf'
    ]
}
  1. Add fetch method:
def fetch_court_documents(self):
    """Fetch court documents"""
    source = PUBLIC_SOURCES['court_docs']
    # Implementation similar to FBI Vault
  1. Update workflow:
- name: Fetch court documents
  run: |
    python scripts/fetch-public-files.py --source court_docs

Best Practices

1. Rate Limiting

2. Verification

3. Documentation

4. Updates

5. Privacy


Troubleshooting

Issue: Download Fails

Solution:

Issue: OCR Not Working

Solution:

# Install Tesseract
sudo apt-get install tesseract-ocr

# Verify installation
tesseract --version

# Test OCR
python -c "import pytesseract; print('OCR ready')"

Issue: Out of Memory

Solution:

Issue: Files Too Large for GitHub

Solution:


Maintenance

Monthly Tasks

Quarterly Tasks


Resources

Official Sources:

Tools:

Documentation:


Summary

βœ… Tools provided:

βœ… Sources supported:

βœ… Features:

Next steps:

  1. Install dependencies
  2. Run python scripts/fetch-public-files.py
  3. Process with python scripts/process-pdfs.py
  4. Update search index
  5. Deploy to GitHub Pages

Cost: $0 - All tools use free tier services!