Concept

VLMs can't recognize details when shown a full screen image.
Like human vision, we use a two-stage approach: "overview first → focus on details" to improve accuracy.

Problem: Full Images Miss Details

When you show VLMs (GPT-4V, Claude, etc.) a single 1920×1080 full screen:

  • Small button text is unreadable
  • Fine UI elements go unrecognized
  • They understand "where things are" but not "what they say"

Humans work the same way. You can't see details by gazing at the whole screen. You need to focus on specific areas.

Solution: Overview + Grid Focus

Two-Stage Recognition Approach

  1. Overview: Full screen image to understand "where things are"
  2. Grid Selection: Select tile numbers for areas needing detail
  3. Focused Detail: 2-3 selected tiles for detailed recognition

24-Tile Grid

Screen divided into 6×4 = 24 tiles. Specify regions by tile number.

┌──────┬──────┬──────┬──────┬──────┬──────┐
│  1   │  2   │  3   │  4   │  5   │  6   │
├──────┼──────┼──────┼──────┼──────┼──────┤
│  7   │  8   │  9   │ 10   │ 11   │ 12   │
├──────┼──────┼──────┼──────┼──────┼──────┤
│ 13   │ 14   │ 15   │ 16   │ 17   │ 18   │
├──────┼──────┼──────┼──────┼──────┼──────┤
│ 19   │ 20   │ 21   │ 22   │ 23   │ 24   │
└──────┴──────┴──────┴──────┴──────┴──────┘

Core Tools

1. Screen Capture 24Grid

Divide screen into 24 tiles and capture specified tiles only.

  • Specify by tile number (top-left=1, bottom-right=24)
  • Select individual or multiple tiles
  • Optional overview image output
# Overview + tiles 8,9 (top-center) for detail
python screen_capture_24grid.py --overview --tiles 8,9

2. Screen Diff Detector

Detect changes and auto-select modified tiles.

  • Detect differences at 24-tile granularity
  • Extract only changed tiles
  • Auto-focus on "what changed"
# Detect changes and focus on modified regions
python screen_diff_detector.py --threshold 0.05

3. OCR Text Locator

Get text coordinates and identify corresponding tiles.

  • Detect text via OCR
  • Get screen coordinates
  • Auto-identify "which tile has the Submit button"
# Find which tile contains "Submit" button
python ocr_text_locator.py --text "Submit"

Design Philosophy: Like Human Vision

Why 24 Tiles?

  • Human visual cognition: Limited area perceivable at once
  • Eye-tracking research: Only 10-20% receives attention
  • F-pattern: Top-left → right → bottom-left movement

VLMs work the same. Narrowing the focus area improves accuracy over showing everything.

CLI/Automation-First Design

  • Number-based: "Show tiles 8,9" in one command
  • Diff detection: "Show only what changed" automatically
  • Pipeline-ready: Combines with other CLI tools

Installation

Requirements

  • Python 3.8+
  • macOS / Linux / Windows

Dependencies

pip install pillow numpy opencv-python pytesseract

Tesseract OCR (for OCR features)

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt-get install tesseract-ocr