Great question — this is one of the most interesting parts of building a conversational AI on social platforms. Here's a comprehensive breakdown of how shared media works across all three platforms and how to analyse it with LLMs and other tools.
When a User Shares Media in a DM
What Meta Actually Sends You
When someone shares a reel, post, image, video, or story, the webhook payload differs slightly per platform, but the core structure is the same — you get a URL or attachment object in the message body.
Platform-by-Platform Breakdown
1. Instagram DM
When a user shares a reel or post inside a DM, the webhook delivers an attachments array:
{
"message": {
"mid": "...",
"attachments": [
{
"type": "ig_reel", // or "image", "video", "ig_reel", "share"
"payload": {
"url": "https://cdn.instagram.com/...",
"title": "Reel title (if any)"
}
}
]
}
}
Shared posts come as "type": "share" with a payload.url pointing to the public post URL. Reels come as "type": "ig_reel" or "video" depending on context.
Key constraints:
- The CDN URL in
payload.url is ephemeral — it expires in minutes. Download immediately upon webhook receipt.
- For private/restricted accounts, you may only get a URL to the post page, not the raw media.
- You need the
instagram_business_manage_messages permission to read attachment payloads.
2. Facebook Messenger
Very similar structure. Facebook is more permissive — you get direct CDN URLs for images and videos:
{
"message": {
"attachments": [
{
"type": "video", // "image", "video", "audio", "file"
"payload": {
"url": "https://video.xx.fbcdn.net/..."
}
}
]
}
}
Shared Facebook posts/reels come via the share type with a payload.url (the public post link) and optionally payload.sticker_id.
3. WhatsApp Business Cloud API
WhatsApp handles media differently — it uses media IDs instead of direct URLs:
{
"messages": [{
"type": "image", // "image", "video", "audio", "document", "sticker"
"image": {
"id": "3857693717630",
"mime_type": "image/jpeg",
"sha256": "...",
"caption": "Check this out"
}
}]
}
You then do a two-step retrieval:
GET https://graph.facebook.com/v19.0/{media-id} → returns a temporary download URL
GET {download_url} with Authorization: Bearer {token} → download the actual file
WhatsApp does not send direct public post/reel links with parsed metadata — it just sends whatever the user pasted as text, or an image/video they forwarded.
The Processing Pipeline
Here's the full architecture for handling shared media:
Webhook received
│
▼
[1] Extract attachment type + URL/media-ID
│
▼
[2] Download media immediately (CDN URLs expire fast)
│
├─ Image → Vision LLM (Claude, GPT-4V, Gemini Vision)
├─ Video → Frame extraction → Vision LLM per frame
├─ Audio → Whisper / Deepgram → transcript → LLM
├─ Post URL → Scrape OG tags / oEmbed → LLM
└─ Reel URL → yt-dlp download → frames + audio
│
▼
[3] Build context + user message → Claude
│
▼
[4] Send reply via Send API
Step-by-Step: Analysing Each Media Type
Images
Send directly to a vision-capable LLM:
import anthropic, base64, httpx
def analyse_image(image_url: str, user_caption: str) -> str:
image_data = base64.b64encode(httpx.get(image_url).content).decode()
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": image_data
}
},
{
"type": "text",
"text": f"The user shared this image and said: '{user_caption}'. Respond helpfully as Orris."
}
]
}]
)
return response.content[0].text
Videos / Reels
Videos need to be broken into frames first (LLMs don't natively process video streams):
import cv2, tempfile, httpx
def extract_frames(video_url: str, num_frames: int = 8) -> list[bytes]:
# Download video
video_bytes = httpx.get(video_url).content
with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as f:
f.write(video_bytes)
tmp_path = f.name
# Extract evenly-spaced frames
cap = cv2.VideoCapture(tmp_path)
total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
frames = []
for i in range(num_frames):
cap.set(cv2.CAP_PROP_POS_FRAMES, int(i * total / num_frames))
ret, frame = cap.read()
if ret:
_, buf = cv2.imencode(".jpg", frame)
frames.append(buf.tobytes())
cap.release()
return frames
Then send all frames in a single Claude message with multiple image blocks — Claude can reason across them coherently.
For the audio track of a reel, extract with ffmpeg and transcribe:
ffmpeg -i reel.mp4 -vn -acodec mp3 audio.mp3
Then send to OpenAI Whisper or Deepgram for transcription, and include the transcript in Claude's context.
Shared Post URLs (Instagram/Facebook)
When you get a share URL like https://www.instagram.com/reel/ABC123/, you have two options:
Option A — oEmbed API (Meta official):
GET https://graph.facebook.com/v19.0/instagram_oembed
?url=https://www.instagram.com/reel/ABC123/
&access_token={token}
Returns: title, author, thumbnail URL, HTML embed. Works without special permissions for public posts.
Option B — Open Graph / scraping:
Fetch the URL and parse <meta property="og:..."> tags — you get the title, description, thumbnail, and video URL without an API call. Tools like BeautifulSoup or opengraph-py3 make this trivial.
import requests
from bs4 import BeautifulSoup
def scrape_og(url: str) -> dict:
html = requests.get(url, headers={"User-Agent": "facebookexternalhit/1.1"}).text
soup = BeautifulSoup(html, "html.parser")
return {
tag["property"]: tag["content"]
for tag in soup.find_all("meta", property=True)
if tag.get("content")
}
Audio Messages (WhatsApp)
WhatsApp voice notes and audio messages come as "type": "audio" with a media ID. After downloading:
import openai
def transcribe_audio(audio_path: str) -> str:
with open(audio_path, "rb") as f:
transcript = openai.audio.transcriptions.create(
model="whisper-1",
file=f
)
return transcript.text
Putting It Together: A Unified Handler
async def handle_attachment(attachment: dict, user_text: str) -> str:
atype = attachment["type"]
url = attachment["payload"].get("url", "")
if atype == "image":
return analyse_image(url, user_text)
elif atype in ("video", "ig_reel"):
frames = extract_frames(url)
transcript = extract_and_transcribe_audio(url) # optional
return analyse_video_frames(frames, transcript, user_text)
elif atype == "share":
og = scrape_og(url)
thumbnail = og.get("og:image")
description = og.get("og:description", "")
# Optionally fetch and analyse thumbnail
return ask_claude_about_post(og, thumbnail, user_text)
elif atype == "audio":
local_path = download_whatsapp_media(attachment["id"])
transcript = transcribe_audio(local_path)
return ask_claude(f"The user sent a voice note: '{transcript}'. {user_text}")
else:
return "I received your attachment but can't process this type yet."
Key Constraints to Keep in Mind
| Platform | Media Type | What You Get | Expiry |
|---|
| Instagram DM | Image/Video | Direct CDN URL | ~1 hour |
| Instagram DM | Shared Reel/Post | Post URL or CDN URL | ~1 hour |
| Facebook Messenger | Image/Video | Direct CDN URL | ~1 hour |
| Facebook Messenger | Shared Post | Post URL + snippet | ~1 hour |
| WhatsApp | Image/Video/Audio | Media ID → temp URL | 30 days (but download ASAP) |
Important:
- Download on webhook receipt — don't store URLs, store the files or base64.
- For private Instagram posts shared in DMs, you'll only get a link back to the post — you can't pull the raw video unless the account owner granted your app permissions.
- Meta's terms prohibit storing user media beyond what's needed to fulfil the request. Build your pipeline to be stateless — process and discard, don't cache media.
- For reels with music, you'll get the audio but the music may be copyright-restricted. Transcription is fine; redistribution is not.
Recommended Tool Stack
| Purpose | Tool |
|---|
| Vision analysis | Claude (claude-opus-4-5 / claude-sonnet-4-5) |
| Audio transcription | OpenAI Whisper or Deepgram Nova-3 |
| Video frame extraction | OpenCV (cv2) or ffmpeg |
| OG/post scraping | BeautifulSoup + requests |
| Reel downloading (dev/test) | yt-dlp |
| Media storage | S3 or GCS (pre-signed URLs, short TTL) |
This pipeline gives Orris full awareness of whatever a user throws at it — images, reels, voice notes, shared posts — and lets Claude reason over all of it contextually before replying.