DetectFaceRectanglesRequest
"Has faces" filter, face count badge, Smart cell display 'Fill to Face' framing.
VisionAnalyzer.swift:607
Vision deep-dive
flexGrid uses Apple's Vision framework the way it was meant to be used. A single VNImageRequestHandler hands one CGImage to fourteen concurrent detectors; in parallel, a single CGContext render of the same image feeds four pixel-domain analyses. The library walks itself through your Mac's on-device intelligence — quietly, slot-throttled, and entirely offline.
What's running, right now
Every photo, when scanned, is handed to one VNImageRequestHandler that fans out to fourteen concurrent detectors. While those run, a single CGContext render of the same image feeds four pixel-domain analyses on a shared RGBA buffer.
Walltime
≈ max(detector duration)
Not the sum. The actor's slot counter throttles concurrent images, never detectors-per-image.
Pixel passes
≈ 1 GPU→CPU copy
The naïve version is four. The considered version is one.
flexGrid uses the new Swift Vision struct API — no
VN prefix on the
request types. They're dispatched concurrently with
async let against
a single image handler.
"Has faces" filter, face count badge, Smart cell display 'Fill to Face' framing.
VisionAnalyzer.swift:607
"Has body" filter, Smart cell display 'Fill to Person' framing.
VisionAnalyzer.swift:630 (upperBodyOnly = false)
Torso rectangles for Smart cell display; nineteen-joint pose fingerprints for the Pose Library.
VisionAnalyzer.swift:1216
"Has hand pose" tag — useful for sign language, gesture content, dance reference.
VisionAnalyzer.swift:794
"Has animals" filter.
VisionAnalyzer.swift:682
Indoor / outdoor / nature / urban / portrait / food / sports — the labels surface in Smart Collections and Spotlight keywords.
VisionAnalyzer.swift:693 · hasMinimumPrecision(0.1, forRecall: 0.8)
Where a viewer looks first. The fallback for Smart cell display when there is no face or body.
VisionAnalyzer.swift:733
Where the salient objects are. Union of all bounding boxes — used for product / animal framing.
VisionAnalyzer.swift:753
A -1…1 quality signal plus a "utility image" flag for screenshots and receipts. Smart Shuffle weights better-looking frames; Spotlight uses the score as a star rating.
VisionAnalyzer.swift:717
"Blurry / smudged" filter at confidence ≥ 0.7. The filter that catches the dog-on-the-camera-lens shots.
VisionAnalyzer.swift:837
"Tilted Horizon" filter at > 5° off level. Diagnostic for landscape rolls.
VisionAnalyzer.swift:851
First five hundred characters per item, searchable. Powers Spotlight 'textContent' when opted in.
VisionAnalyzer.swift:653 · recognitionLevel = .fast
Cached up to ten payloads per item, 500 chars each.
VisionAnalyzer.swift:774
A compact embedding per image — powers "Find similar" using Apple's built-in L2 distance.
VisionAnalyzer.swift:809,1189 · stored as JSON via Codable
Vision doesn't cover everything. Letterbox bars, brightness classification, monochrome detection and composite/collage detection are pure pixel-domain reads — and they don't need their own GPU-to-CPU copy each. flexGrid renders the image into one heap-allocated RGBA buffer and feeds all four analyses from it.
Scans middle 80 % of rows and columns for low-variance bands (variance < 400), clusters them within 2 % tolerance, then requires both regions either side of each seam to be photographic (variance > 800) to suppress text and memes. regionCount ≥ 2 means it's a collage.
VisionAnalyzer.swift:1019-1183
ITU-R BT.601 luminance, 0.299·R + 0.587·G + 0.114·B. < 64 is dark, > 192 is bright. Drives the brightness filter and Smart Collection 'Quiet Reflections'.
VisionAnalyzer.swift:907-933
Mean RGB-max-minus-min saturation < 0.08 across the buffer. Distinguishes a true black-and-white photo from a low-saturation one.
VisionAnalyzer.swift:914-937
Top, bottom, left, right black-bar scan with a 5 % minimum bar size. The image-side equivalent of CleanApertureDetector for video.
VisionAnalyzer.swift:942-1004
The naïve version is four CGImage renders, four GPU-to-CPU copies, four heap allocations. The considered version is one. That's the difference between "Vision feels fast" and "Vision keeps the fan quiet."
Four separate GPU-to-CPU copies would be the normal way. One copy is the considered way.
VisionAnalyzer.swift:872-945
Click the scissors on any cell and the subject floats above the
grid borders. Behind the click: a real Vision pipeline that
ends in a SwiftUI .mask()
modifier. Six steps, fully on-device.
CGImage at fast / balanced / detailed resolution (384 / 512 / 1024 px)
VisionAnalyzer.swift:340-353
VNImageRequestHandler — legacy API, intentional, for mask support
VisionAnalyzer.swift:357
VNGenerateForegroundInstanceMaskRequest — runs subject segmentation
macOS 14+ · Apple Silicon recommended
observation.generateScaledMaskForImage(forInstances:from:)
Returns CVPixelBuffer (8-bit gray)
CGContext (DeviceGray, alpha .none) → NSBitmapImageRep PNG
VisionAnalyzer.swift:390-407
SwiftUI .mask(alignment:) { Image(nsImage: maskImage) }
CutoutRenderer.swift:18-23
The mask only generates when there's a reason — face, body, animals, or saliency area > 15 %. Otherwise the request never fires, and the wall stays cool.
The most distinctive architectural rule in flexGrid: a feature isn't visible until the data it leans on is in. Filters, Smart Collections, Editorial Layout's hero picker, Subject Cutout — all gated by coverage thresholds against the live Vision sweep.
The surface effect is that you never see a half-empty filter menu, you never click a "Find similar" button that doesn't have an embedding yet, and you never get a Cutout cell that produces nothing. The features show up when they're real.
Coverage rules · in the app, right now
We never show you a control that wouldn't work.
BEHIND_THE_SCENES.md · ScanReadinessGate
VisionAnalyzer is a real Swift 6 actor. Per-image, the fourteen
detectors run concurrently with
async let and
await in one block — total walltime is roughly the slowest
detector, not the sum. The actor's internal slot counter
throttles concurrent images, not detectors per image,
so the GPU never gets oversubscribed.
Per-image walltime
≈ max
max(detector duration) — not the sum.
Pixel passes
1 copy
One GPU→CPU render. Four answers.
Memory pressure
4 tiers
Normal · Elevated · High pauses Vision · Critical triages everything.
Privacy
Vision is on-device by design. flexGrid doesn't bypass it. There are no network calls in the Vision path, no third-party endpoints anywhere, no inference offloaded to a server. Your library is analyzed on your Mac, and the results live on your Mac, and that's where they stay.
FoundationModels, when it runs, follows the same contract: Apple Intelligence is on-device too. We just don't break the contract for you.
The bottom line
None of the engine work matters unless you can feel it in the app. Drop a folder, watch the filters appear as the sweep finishes, and the work below stops being engineering and starts being yours.