Concept
VLMs can't recognize details when shown a full screen image.
Like human vision, we use a two-stage approach: "overview first → focus on details" to improve accuracy.
Problem: Full Images Miss Details
When you show VLMs (GPT-4V, Claude, etc.) a single 1920×1080 full screen:
- Small button text is unreadable
- Fine UI elements go unrecognized
- They understand "where things are" but not "what they say"
Humans work the same way. You can't see details by gazing at the whole screen. You need to focus on specific areas.
Solution: Overview + Grid Focus
Two-Stage Recognition Approach
- Overview: Full screen image to understand "where things are"
- Grid Selection: Select tile numbers for areas needing detail
- Focused Detail: 2-3 selected tiles for detailed recognition
24-Tile Grid
Screen divided into 6×4 = 24 tiles. Specify regions by tile number.
┌──────┬──────┬──────┬──────┬──────┬──────┐
│ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │
├──────┼──────┼──────┼──────┼──────┼──────┤
│ 7 │ 8 │ 9 │ 10 │ 11 │ 12 │
├──────┼──────┼──────┼──────┼──────┼──────┤
│ 13 │ 14 │ 15 │ 16 │ 17 │ 18 │
├──────┼──────┼──────┼──────┼──────┼──────┤
│ 19 │ 20 │ 21 │ 22 │ 23 │ 24 │
└──────┴──────┴──────┴──────┴──────┴──────┘
Core Tools
1. Screen Capture 24Grid
Divide screen into 24 tiles and capture specified tiles only.
- Specify by tile number (top-left=1, bottom-right=24)
- Select individual or multiple tiles
- Optional overview image output
# Overview + tiles 8,9 (top-center) for detail
python screen_capture_24grid.py --overview --tiles 8,9
2. Screen Diff Detector
Detect changes and auto-select modified tiles.
- Detect differences at 24-tile granularity
- Extract only changed tiles
- Auto-focus on "what changed"
# Detect changes and focus on modified regions
python screen_diff_detector.py --threshold 0.05
3. OCR Text Locator
Get text coordinates and identify corresponding tiles.
- Detect text via OCR
- Get screen coordinates
- Auto-identify "which tile has the Submit button"
# Find which tile contains "Submit" button
python ocr_text_locator.py --text "Submit"
Design Philosophy: Like Human Vision
Why 24 Tiles?
- Human visual cognition: Limited area perceivable at once
- Eye-tracking research: Only 10-20% receives attention
- F-pattern: Top-left → right → bottom-left movement
VLMs work the same. Narrowing the focus area improves accuracy over showing everything.
CLI/Automation-First Design
- Number-based: "Show tiles 8,9" in one command
- Diff detection: "Show only what changed" automatically
- Pipeline-ready: Combines with other CLI tools
Installation
Requirements
- Python 3.8+
- macOS / Linux / Windows
Dependencies
pip install pillow numpy opencv-python pytesseract
Tesseract OCR (for OCR features)
# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt-get install tesseract-ocr