View on GitHub Download

💰 Token Cost Reduction Results

Method Image Size Tokens (GPT-4V) Monthly Cost (100x/day)
Full Screen (1920×1080) 2.07MB ~1,700 tokens $500+
24-Tile Division (1 tile) ~90KB ~170 tokens $50
After Diff Detection (avg 3 tiles) ~270KB ~510 tokens $150

Reduction Rate: Approx. 70-90% (varies by use case)

What This Toolkit Solves

🚀 Core Tools

1. Screen Capture 24Grid

Purpose: Divide screen into 24 tiles and send only necessary ones

Key Features:

# Get only tiles 7, 8, 13, 14 (center 4 tiles)
python screen_capture_24grid.py --tiles 7,8,13,14

2. Screen Diff Detector

Purpose: Detect changes and send only modified tiles

Key Features:

# Detect changes and send only modified regions
python screen_diff_detector.py --threshold 0.05

3. OCR Text Locator

Purpose: Get text coordinates and identify corresponding tiles

Key Features:

# Detect "Submit" button coordinates and identify tile
python ocr_text_locator.py --text "Submit"

🧠 Design Philosophy: Why 24 Tiles?

Cognitive Science Basis

Technical Optimization

Diff Detection Algorithm

1. Hash Comparison per Tile

Hash pixel data of each tile and detect differences from previous version at high speed.

2. Change Threshold Adjustment

# Send only tiles with 5%+ change
--threshold 0.05

3. Change Pattern Optimization

📖 VLA Project Background

What is VLA?

System that automates PC operations by showing the screen to Claude (Vision Language Model). Unlike traditional RPA (image recognition-based), VLM "understands" screen content for operations.

Challenge: Token Cost Explosion

Solution: 24-Tile Division + Diff Detection

  1. Phase 1: Divide screen into 24 tiles, send only necessary ones (90% reduction)
  2. Phase 2: Send only changed regions via diff detection (further reduction)
  3. Phase 3: Send only regions around specific OCR text (pinpoint accuracy)

Implementation Results

Integration with Mass Production System

This toolkit was developed as part of the SPQR (Semi-autonomous Prototyping and Quality Refinement) system. Through quality-first mass production methodology, we cycled through implementation → testing → refinement at high speed, reaching production-ready status in 3 weeks.

🛠️ Use Cases (Practical)

Case 1: Web Form Auto-Fill

# 1. Check entire screen with 24-tile division
python screen_capture_24grid.py --tiles all

# 2. Claude: "Form is in screen center (tiles 13, 14)"

# 3. Send only relevant tiles
python screen_capture_24grid.py --tiles 13,14

# 4. Claude: "Please enter name field" → Execute input

# 5. Confirm changes with diff detection
python screen_diff_detector.py --threshold 0.05

Token Reduction: 1,700 tokens → 340 tokens (80% reduction)

Case 2: Dynamic Content Monitoring

# 1. Get entire screen initially
python screen_capture_24grid.py --tiles all

# 2. Detect only diffs afterwards
while true; do
    python screen_diff_detector.py --threshold 0.05
    sleep 5
done

Effect: Zero token consumption when no changes detected

📦 Installation

Requirements

Dependencies

pip install pillow numpy opencv-python pytesseract

Tesseract OCR Installation (for OCR features)

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

🚀 Get Started Now

Download the latest version from GitHub repository and reduce VLM automation token costs by 90%.

View on GitHub Download