Vol. 01 · No. 01 · The Vision edition · On-device · always

Vision deep-dive

Fourteen detectors.
On every photo.
On your Mac.

flexGrid uses Apple's Vision framework the way it was meant to be used. A single VNImageRequestHandler hands one CGImage to fourteen concurrent detectors; in parallel, a single CGContext render of the same image feeds four pixel-domain analyses. The library walks itself through your Mac's on-device intelligence — quietly, slot-throttled, and entirely offline.

Apple's framework· On-device· Strict actor isolation· Slot-throttled concurrency· ScanReadinessGate

Live: Vision actor

What's running, right now

Every photo, when scanned, is handed to one VNImageRequestHandler that fans out to fourteen concurrent detectors. While those run, a single CGContext render of the same image feeds four pixel-domain analyses on a shared RGBA buffer.

subject

scene

saliency

text

embedding

quality

Walltime

≈ max(detector duration)

Not the sum. The actor's slot counter throttles concurrent images, never detectors-per-image.

Pixel passes

≈ 1 GPU→CPU copy

The naïve version is four. The considered version is one.

01. The fourteen detectors

Every request type, every file reference, every macOS gate.

flexGrid uses the new Swift Vision struct API — no VN prefix on the request types. They're dispatched concurrently with async let against a single image handler.

01 Subject detection

5 requests

Face rectangles macOS 14+

DetectFaceRectanglesRequest

"Has faces" filter, face count badge, Smart cell display 'Fill to Face' framing.

VisionAnalyzer.swift:607

Human bodies macOS 14+

DetectHumanRectanglesRequest

"Has body" filter, Smart cell display 'Fill to Person' framing.

VisionAnalyzer.swift:630 (upperBodyOnly = false)

Body pose macOS 14+

DetectHumanBodyPoseRequest

Torso rectangles for Smart cell display; nineteen-joint pose fingerprints for the Pose Library.

VisionAnalyzer.swift:1216

Hand pose macOS 14+

DetectHumanHandPoseRequest

"Has hand pose" tag — useful for sign language, gesture content, dance reference.

VisionAnalyzer.swift:794

Animals macOS 14+

RecognizeAnimalsRequest

"Has animals" filter.

VisionAnalyzer.swift:682

02 Scene understanding

1 request

Scene classification macOS 14+

ClassifyImageRequest

Indoor / outdoor / nature / urban / portrait / food / sports — the labels surface in Smart Collections and Spotlight keywords.

VisionAnalyzer.swift:693 · hasMinimumPrecision(0.1, forRecall: 0.8)

03 Saliency

2 requests

Attention saliency macOS 14+

GenerateAttentionBasedSaliencyImageRequest

Where a viewer looks first. The fallback for Smart cell display when there is no face or body.

VisionAnalyzer.swift:733

Objectness saliency macOS 14+

GenerateObjectnessBasedSaliencyImageRequest

Where the salient objects are. Union of all bounding boxes — used for product / animal framing.

VisionAnalyzer.swift:753

04 Quality signals

3 requests

Aesthetics score macOS 14+

CalculateImageAestheticsScoresRequest

A -1…1 quality signal plus a "utility image" flag for screenshots and receipts. Smart Shuffle weights better-looking frames; Spotlight uses the score as a star rating.

VisionAnalyzer.swift:717

Lens smudge / blur macOS 26

DetectLensSmudgeRequest

"Blurry / smudged" filter at confidence ≥ 0.7. The filter that catches the dog-on-the-camera-lens shots.

VisionAnalyzer.swift:837

Horizon tilt macOS 14+

DetectHorizonRequest

"Tilted Horizon" filter at > 5° off level. Diagnostic for landscape rolls.

VisionAnalyzer.swift:851

05 Text & barcodes

2 requests

Text (OCR) macOS 14+

RecognizeTextRequest

First five hundred characters per item, searchable. Powers Spotlight 'textContent' when opted in.

VisionAnalyzer.swift:653 · recognitionLevel = .fast

Barcodes / QR macOS 14+

DetectBarcodesRequest

Cached up to ten payloads per item, 500 chars each.

VisionAnalyzer.swift:774

06 Embeddings

1 request

Feature print macOS 14+

GenerateImageFeaturePrintRequest

A compact embedding per image — powers "Find similar" using Apple's built-in L2 distance.

VisionAnalyzer.swift:809,1189 · stored as JSON via Codable

02. One render, four answers

A separate set of pixel-domain analyses, sharing a buffer.

Vision doesn't cover everything. Letterbox bars, brightness classification, monochrome detection and composite/collage detection are pure pixel-domain reads — and they don't need their own GPU-to-CPU copy each. flexGrid renders the image into one heap-allocated RGBA buffer and feeds all four analyses from it.

Pass 1 CGContext · shared

Composite detection

Scans middle 80 % of rows and columns for low-variance bands (variance < 400), clusters them within 2 % tolerance, then requires both regions either side of each seam to be photographic (variance > 800) to suppress text and memes. regionCount ≥ 2 means it's a collage.

VisionAnalyzer.swift:1019-1183

Pass 2 CGContext · shared

Brightness classification

ITU-R BT.601 luminance, 0.299·R + 0.587·G + 0.114·B. < 64 is dark, > 192 is bright. Drives the brightness filter and Smart Collection 'Quiet Reflections'.

VisionAnalyzer.swift:907-933

Pass 3 CGContext · shared

Monochrome detection

Mean RGB-max-minus-min saturation < 0.08 across the buffer. Distinguishes a true black-and-white photo from a low-saturation one.

VisionAnalyzer.swift:914-937

Pass 4 CGContext · shared

Letterbox / pillarbox

Top, bottom, left, right black-bar scan with a 5 % minimum bar size. The image-side equivalent of CleanApertureDetector for video.

VisionAnalyzer.swift:942-1004

The naïve version is four CGImage renders, four GPU-to-CPU copies, four heap allocations. The considered version is one. That's the difference between "Vision feels fast" and "Vision keeps the fan quiet."

“

Four separate GPU-to-CPU copies would be the normal way. One copy is the considered way.

VisionAnalyzer.swift:872-945

03. Subject Cutout

A foreground instance mask, lifted into SwiftUI.

Click the scissors on any cell and the subject floats above the grid borders. Behind the click: a real Vision pipeline that ends in a SwiftUI .mask() modifier. Six steps, fully on-device.

1

CGImage at fast / balanced / detailed resolution (384 / 512 / 1024 px)

VisionAnalyzer.swift:340-353
2

VNImageRequestHandler — legacy API, intentional, for mask support

VisionAnalyzer.swift:357
3

VNGenerateForegroundInstanceMaskRequest — runs subject segmentation

macOS 14+ · Apple Silicon recommended
4

observation.generateScaledMaskForImage(forInstances:from:)

Returns CVPixelBuffer (8-bit gray)
5

CGContext (DeviceGray, alpha .none) → NSBitmapImageRep PNG

VisionAnalyzer.swift:390-407
6

SwiftUI .mask(alignment:) { Image(nsImage: maskImage) }

CutoutRenderer.swift:18-23

The mask only generates when there's a reason — face, body, animals, or saliency area > 15 %. Otherwise the request never fires, and the wall stays cool.

04. ScanReadinessGate

Invisible until ready.

The most distinctive architectural rule in flexGrid: a feature isn't visible until the data it leans on is in. Filters, Smart Collections, Editorial Layout's hero picker, Subject Cutout — all gated by coverage thresholds against the live Vision sweep.

The surface effect is that you never see a half-empty filter menu, you never click a "Find similar" button that doesn't have an embedding yet, and you never get a Cutout cell that produces nothing. The features show up when they're real.

Coverage rules · in the app, right now

Analyzing · 1,843 / 4,212

Vision sweep · all 14 detectors

Vision tag filters

≥ 60% coverage
Smart Collections

≥ 60% coverage
Editorial Layout heroes

Vision-aware slot picker waits for aesthetic + scene tags
Smart cell display

Per-cell; uses whatever is ready
Subject Cutout

Triggers only when faces/body/animals/large saliency present
Aspect Match

Ready as soon as EXIF / first-frame AR is in

“

We never show you a control that wouldn't work.

BEHIND_THE_SCENES.md · ScanReadinessGate

05. Concurrency

An actor, a slot counter, and a TaskGroup.

VisionAnalyzer is a real Swift 6 actor. Per-image, the fourteen detectors run concurrently with async let and await in one block — total walltime is roughly the slowest detector, not the sum. The actor's internal slot counter throttles concurrent images, not detectors per image, so the GPU never gets oversubscribed.

Per-image walltime

≈ max

max(detector duration) — not the sum.

Pixel passes

1 copy

One GPU→CPU render. Four answers.

Memory pressure

4 tiers

Normal · Elevated · High pauses Vision · Critical triages everything.

Privacy

On-device. Always. By the framework.

Vision is on-device by design. flexGrid doesn't bypass it. There are no network calls in the Vision path, no third-party endpoints anywhere, no inference offloaded to a server. Your library is analyzed on your Mac, and the results live on your Mac, and that's where they stay.

FoundationModels, when it runs, follows the same contract: Apple Intelligence is on-device too. We just don't break the contract for you.

The bottom line

All this. So the wall just plays.

None of the engine work matters unless you can feel it in the app. Drop a folder, watch the filters appear as the sweep finishes, and the work below stops being engineering and starts being yours.

Request beta access

All Apple frameworks →

Fourteen detectors. On every photo. On your Mac.