VLM Token Optimization Toolkit

View on GitHub Download

💰 Token Cost Reduction Results

Method	Image Size	Tokens (GPT-4V)	Monthly Cost (100x/day)
Full Screen (1920×1080)	2.07MB	~1,700 tokens	$500+
24-Tile Division (1 tile)	~90KB	~170 tokens	$50
After Diff Detection (avg 3 tiles)	~270KB	~510 tokens	$150

Reduction Rate: Approx. 70-90% (varies by use case)

What This Toolkit Solves

Dramatic image token cost reduction: From full-screen transmission to only necessary regions
Improved VLM response speed: Minimizing image processing time through size reduction
Practical VLM-based automation: Cost reduction enables production-level automated operations

🚀 Core Tools

1. Screen Capture 24Grid

Purpose: Divide screen into 24 tiles and send only necessary ones

Key Features:

Divide screen into 6×4 (24 tiles)
Specify by tile number (top-left=1, bottom-right=24)
Select individual or multiple tiles

# Get only tiles 7, 8, 13, 14 (center 4 tiles)
python screen_capture_24grid.py --tiles 7,8,13,14

2. Screen Diff Detector

Purpose: Detect changes and send only modified tiles

Key Features:

Detect differences from previous screenshot at 24-tile granularity
Extract only changed tiles
Adjustable diff threshold

# Detect changes and send only modified regions
python screen_diff_detector.py --threshold 0.05

3. OCR Text Locator

Purpose: Get text coordinates and identify corresponding tiles

Key Features:

Detect text via OCR
Get text coordinates on screen
Reverse-calculate corresponding tiles from coordinates

# Detect "Submit" button coordinates and identify tile
python ocr_text_locator.py --text "Submit"

🧠 Design Philosophy: Why 24 Tiles?

Cognitive Science Basis

Human visual cognition: Humans can perceive only a limited area at once
Eye-tracking research: Only 10-20% of total area receives attention during web/app interaction
F-pattern: Eye movement pattern: top-left → right → bottom-left

Technical Optimization

6×4 Grid: 1 tile = 320×270px (for 1920×1080)
Claude Desktop Integration: Compatible with existing screenshot features
Flexibility: Specify from single tile to all 24 tiles

Diff Detection Algorithm

1. Hash Comparison per Tile

Hash pixel data of each tile and detect differences from previous version at high speed.

2. Change Threshold Adjustment

# Send only tiles with 5%+ change
--threshold 0.05

3. Change Pattern Optimization

Static UI: Buttons, menus (no transmission needed)
Dynamic Content: Text input fields, scroll regions (send only these)
Animations: Ignore minor changes (adjust with threshold)

📖 VLA Project Background

What is VLA?

System that automates PC operations by showing the screen to Claude (Vision Language Model). Unlike traditional RPA (image recognition-based), VLM "understands" screen content for operations.

Challenge: Token Cost Explosion

Sending Full HD screen (1920×1080) each time = 1,700 tokens per call
100 operations/day = 170,000 tokens (over $500/month)
Production deployment requires cost reduction as absolute condition

Solution: 24-Tile Division + Diff Detection

Phase 1: Divide screen into 24 tiles, send only necessary ones (90% reduction)
Phase 2: Send only changed regions via diff detection (further reduction)
Phase 3: Send only regions around specific OCR text (pinpoint accuracy)

Implementation Results

Token reduction rate: 70-90%
Monthly cost: $500 → $50-150
VLM response speed: 30-50% improvement (reduced image processing time)

Integration with Mass Production System

This toolkit was developed as part of the SPQR (Semi-autonomous Prototyping and Quality Refinement) system. Through quality-first mass production methodology, we cycled through implementation → testing → refinement at high speed, reaching production-ready status in 3 weeks.

🛠️ Use Cases (Practical)

Case 1: Web Form Auto-Fill

# 1. Check entire screen with 24-tile division
python screen_capture_24grid.py --tiles all

# 2. Claude: "Form is in screen center (tiles 13, 14)"

# 3. Send only relevant tiles
python screen_capture_24grid.py --tiles 13,14

# 4. Claude: "Please enter name field" → Execute input

# 5. Confirm changes with diff detection
python screen_diff_detector.py --threshold 0.05

Token Reduction: 1,700 tokens → 340 tokens (80% reduction)

Case 2: Dynamic Content Monitoring

# 1. Get entire screen initially
python screen_capture_24grid.py --tiles all

# 2. Detect only diffs afterwards
while true; do
    python screen_diff_detector.py --threshold 0.05
    sleep 5
done

Effect: Zero token consumption when no changes detected

📦 Installation

Requirements

Python 3.8+
macOS (for Claude Desktop integration)
Linux/Windows (standalone use)

Dependencies

pip install pillow numpy opencv-python pytesseract

Tesseract OCR Installation (for OCR features)

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

🚀 Get Started Now

Download the latest version from GitHub repository and reduce VLM automation token costs by 90%.

View on GitHub Download