Screen recognition system that reduced GPT-4V image tokens by 90%
| Method | Image Size | Tokens (GPT-4V) | Monthly Cost (100x/day) |
|---|---|---|---|
| Full Screen (1920×1080) | 2.07MB | ~1,700 tokens | $500+ |
| 24-Tile Division (1 tile) | ~90KB | ~170 tokens | $50 |
| After Diff Detection (avg 3 tiles) | ~270KB | ~510 tokens | $150 |
Reduction Rate: Approx. 70-90% (varies by use case)
Purpose: Divide screen into 24 tiles and send only necessary ones
Key Features:
# Get only tiles 7, 8, 13, 14 (center 4 tiles)
python screen_capture_24grid.py --tiles 7,8,13,14
Purpose: Detect changes and send only modified tiles
Key Features:
# Detect changes and send only modified regions
python screen_diff_detector.py --threshold 0.05
Purpose: Get text coordinates and identify corresponding tiles
Key Features:
# Detect "Submit" button coordinates and identify tile
python ocr_text_locator.py --text "Submit"
1. Hash Comparison per Tile
Hash pixel data of each tile and detect differences from previous version at high speed.
2. Change Threshold Adjustment
# Send only tiles with 5%+ change
--threshold 0.05
3. Change Pattern Optimization
System that automates PC operations by showing the screen to Claude (Vision Language Model). Unlike traditional RPA (image recognition-based), VLM "understands" screen content for operations.
This toolkit was developed as part of the SPQR (Semi-autonomous Prototyping and Quality Refinement) system. Through quality-first mass production methodology, we cycled through implementation → testing → refinement at high speed, reaching production-ready status in 3 weeks.
# 1. Check entire screen with 24-tile division
python screen_capture_24grid.py --tiles all
# 2. Claude: "Form is in screen center (tiles 13, 14)"
# 3. Send only relevant tiles
python screen_capture_24grid.py --tiles 13,14
# 4. Claude: "Please enter name field" → Execute input
# 5. Confirm changes with diff detection
python screen_diff_detector.py --threshold 0.05
Token Reduction: 1,700 tokens → 340 tokens (80% reduction)
# 1. Get entire screen initially
python screen_capture_24grid.py --tiles all
# 2. Detect only diffs afterwards
while true; do
python screen_diff_detector.py --threshold 0.05
sleep 5
done
Effect: Zero token consumption when no changes detected
pip install pillow numpy opencv-python pytesseract
# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
Download the latest version from GitHub repository and reduce VLM automation token costs by 90%.
View on GitHub Download