Email Reply Extraction with Talon
Why Talon?
Email threads accumulate quoted replies that clutter the actual content. When processing emails programmatically, you need just the new message, not the entire conversation history.
Talon solves this problem by extracting clean reply content through sophisticated pattern matching and structural analysis.
Use Cases
- AI Email Agents: Extract new user messages without processing entire thread history
- Email Automation: Parse replies to identify actionable content
- Thread Analysis: Build conversation flows by isolating individual contributions
- Inbox Management: Process only new information from replies
Why Choose Talon?
Handles Gmail, Outlook, Apple Mail, Thunderbird HTML structures
93.8% success rate across 64 real-world test cases
Supports English, Japanese, Swedish, Polish, Dutch, German
1.92ms average processing time, 488 emails/second
How Talon Works
Talon uses two complementary approaches depending on email format:
Plain Text Processing (6-Stage Pipeline)
- Line Classification: Assigns markers to each line (‘t’=text, ‘m’=quote marker, ‘s’=splitter, ‘e’=empty)
- Pattern Matching: Applies regex to marker sequences to identify quoted blocks
- Content Extraction: Removes quoted lines and returns clean text
Recognizes patterns like:
- Standard quote markers (
>) - Reply headers (“On [date] [name] wrote:”)
- Forward indicators (“-----Original Message-----“)
HTML Processing (8-Stage Pipeline)
- Structural Removal: Directly removes known quotation elements (Gmail divs, blockquotes, Outlook markup)
- Checkpoint Fallback: For non-standard HTML, maps elements to text lines, applies text patterns, removes corresponding HTML
Processing Systems
Quotation Removal (Primary)
- Removes quoted replies from thread
- No initialization required
- Rule-based pattern matching
Getting Started
Performance & Accuracy
Talon has been tested on 64 real-world emails from various clients and languages.
Test Results Summary
Test Coverage
- 22 HTML emails: Gmail, Outlook, Apple Mail, Thunderbird, Mail.ru, Hotmail
- 42 plain text emails: Various formats and reply styles
- 6+ languages: English, Japanese, Swedish, Polish, Dutch, German
- Mobile clients: iPhone, Android “Sent from” signatures
Processing Time by Complexity
Speed vs Accuracy Tradeoff
Insight: For production systems, 1.92ms average is negligible. Even at worst case (21.55ms), Talon is faster than most network requests.
Known Limitations
Talon failed 4 out of 64 test cases. Here’s what didn’t work:
Failed Test Cases (4 total)
Test Case 1: Complex Email Thread with Mixed Content
Input:
Expected Output: First 5 lines only (up to Christopher Edwards)
Talon’s Output: Returns entire email including quoted text starting with “On Mon, Jun 3…” and all ”> quoted text”
Processing Time: 2.55ms
Issue: Signature placement before quotes confuses detection logic
Test Case 2: Inline Responses
Input:
Expected Output: Just the inline responses (I will reply under this one and and under this.)
Talon’s Output: Returns everything including “On Tue, Apr 29…” header and all quoted lines
Processing Time: 0.48ms
Issue: Interleaved inline responses not recognized as the reply pattern
Test Case 3: Gmail Forward HTML
Input:
Expected Output: Just testblah (before the forward marker)
Talon’s Output: Includes ”---------- Forwarded message ----------” and forwarded content
Processing Time: 3.41ms
Issue: HTML forward headers not removed by Gmail quote detection
Test Case 4: Thunderbird Forward HTML
Input:
Expected Output: Empty (no new content, just forward)
Talon’s Output: Includes ”-------- Forwarded Message --------” and forwarded content
Processing Time: 4.34ms
Issue: Thunderbird’s moz-forward-container class not recognized
Summary: 3 of 4 failures are forwarded messages. Regular replies work with 98%+ accuracy.
Success Examples
Example 1: Simple Gmail Reply
Input:
Talon’s Output: Awesome! I haven't had another problem with it.
Processing Time: 0.2ms
What Worked: Standard “On [date] [name] wrote:” pattern detected, quote marker (>) recognized
Example 2: Outlook Reply with Separator
Input:
Talon’s Output: Outlook with a reply directly above line
Processing Time: 0.51ms
What Worked: Outlook separator line (underscores) and “From:”/“Sent:” headers detected as splitter
Example 3: HTML Outlook Reply
Input:
Talon’s Output: Reply
Processing Time: 4.02ms
What Worked: Outlook’s OLK_SRC_BODY_SECTION span ID detected and removed structurally
Performance vs Simpler Alternatives
Tradeoff: Talon is more comprehensive but slower than plain-text-only libraries
- Talon: 1.92ms average (with HTML support)
- email-reply-parser: 0.03ms average (plain text only)
For production systems, 1.92ms average is negligible. Even at worst case (21.55ms), Talon is faster than most network requests.
Forwarded Messages
As shown in test results, forwarded messages (especially HTML) are challenging:
- Plain text forwards: Generally work well
- HTML forwards: May retain forward headers
- Workaround: Use plain text extraction or post-process to remove forward markers
Error Handling
Always handle potential parsing failures:
Testing Recommendations
Always test with your specific email formats:
Test with real emails from your users’ actual email clients. Talon’s accuracy is based on diverse real-world samples, but your specific use case may have unique patterns.
JavaScript Version
For TypeScript/JavaScript projects, use TalonJS - a JavaScript port of Talon with similar functionality.
Performance Comparison
TalonJS provides 90.6% accuracy with slightly faster performance (1.88ms), making it ideal for JavaScript/TypeScript environments without needing Python dependencies.
Quick Start
When to use TalonJS vs Python Talon:
- Use TalonJS if you’re building in TypeScript/JavaScript and 90.6% accuracy is sufficient
- Use Python Talon if you need the highest accuracy (93.8%) or are in a Python environment
- The 3.2% accuracy difference is acceptable for most use cases
