VLM Grid Focus - TAKAWASI

Concept

VLMs can't recognize details when shown a full screen image.
Like human vision, we use a two-stage approach: "overview first → focus on details" to improve accuracy.

Problem: Full Images Miss Details

When you show VLMs (GPT-4V, Claude, etc.) a single 1920×1080 full screen:

Small button text is unreadable
Fine UI elements go unrecognized
They understand "where things are" but not "what they say"

Humans work the same way. You can't see details by gazing at the whole screen. You need to focus on specific areas.

Solution: Overview + Grid Focus

Two-Stage Recognition Approach

Overview: Full screen image to understand "where things are"
Grid Selection: Select tile numbers for areas needing detail
Focused Detail: 2-3 selected tiles for detailed recognition

24-Tile Grid

Screen divided into 6×4 = 24 tiles. Specify regions by tile number.

┌──────┬──────┬──────┬──────┬──────┬──────┐
│  1   │  2   │  3   │  4   │  5   │  6   │
├──────┼──────┼──────┼──────┼──────┼──────┤
│  7   │  8   │  9   │ 10   │ 11   │ 12   │
├──────┼──────┼──────┼──────┼──────┼──────┤
│ 13   │ 14   │ 15   │ 16   │ 17   │ 18   │
├──────┼──────┼──────┼──────┼──────┼──────┤
│ 19   │ 20   │ 21   │ 22   │ 23   │ 24   │
└──────┴──────┴──────┴──────┴──────┴──────┘

Core Tools

1. Screen Capture 24Grid

Divide screen into 24 tiles and capture specified tiles only.

Specify by tile number (top-left=1, bottom-right=24)
Select individual or multiple tiles
Optional overview image output

# Overview + tiles 8,9 (top-center) for detail
python screen_capture_24grid.py --overview --tiles 8,9

2. Screen Diff Detector

Detect changes and auto-select modified tiles.

Detect differences at 24-tile granularity
Extract only changed tiles
Auto-focus on "what changed"

# Detect changes and focus on modified regions
python screen_diff_detector.py --threshold 0.05

3. OCR Text Locator

Get text coordinates and identify corresponding tiles.

Detect text via OCR
Get screen coordinates
Auto-identify "which tile has the Submit button"

# Find which tile contains "Submit" button
python ocr_text_locator.py --text "Submit"

Design Philosophy: Like Human Vision

Why 24 Tiles?

Human visual cognition: Limited area perceivable at once
Eye-tracking research: Only 10-20% receives attention
F-pattern: Top-left → right → bottom-left movement

VLMs work the same. Narrowing the focus area improves accuracy over showing everything.

CLI/Automation-First Design

Number-based: "Show tiles 8,9" in one command
Diff detection: "Show only what changed" automatically
Pipeline-ready: Combines with other CLI tools

Installation

Requirements

Python 3.8+
macOS / Linux / Windows

Dependencies

pip install pillow numpy opencv-python pytesseract

Tesseract OCR (for OCR features)

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

GitHub

Source code & documentation