Email Reply Extraction with Talon

Extract clean reply content from email threads using Talon library

Why Talon?

Email threads accumulate quoted replies that clutter the actual content. When processing emails programmatically, you need just the new message, not the entire conversation history.

Talon solves this problem by extracting clean reply content through sophisticated pattern matching and structural analysis.

Use Cases

  • AI Email Agents: Extract new user messages without processing entire thread history
  • Email Automation: Parse replies to identify actionable content
  • Thread Analysis: Build conversation flows by isolating individual contributions
  • Inbox Management: Process only new information from replies

Why Choose Talon?

HTML Email Support

Handles Gmail, Outlook, Apple Mail, Thunderbird HTML structures

High Accuracy

93.8% success rate across 64 real-world test cases

Multi-language

Supports English, Japanese, Swedish, Polish, Dutch, German

Fast Performance

1.92ms average processing time, 488 emails/second


How Talon Works

Talon uses two complementary approaches depending on email format:

Plain Text Processing (6-Stage Pipeline)

  1. Line Classification: Assigns markers to each line (‘t’=text, ‘m’=quote marker, ‘s’=splitter, ‘e’=empty)
  2. Pattern Matching: Applies regex to marker sequences to identify quoted blocks
  3. Content Extraction: Removes quoted lines and returns clean text

Recognizes patterns like:

  • Standard quote markers (>)
  • Reply headers (“On [date] [name] wrote:”)
  • Forward indicators (“-----Original Message-----“)

HTML Processing (8-Stage Pipeline)

  1. Structural Removal: Directly removes known quotation elements (Gmail divs, blockquotes, Outlook markup)
  2. Checkpoint Fallback: For non-standard HTML, maps elements to text lines, applies text patterns, removes corresponding HTML

Processing Systems

Quotation Removal (Primary)

  • Removes quoted replies from thread
  • No initialization required
  • Rule-based pattern matching

Getting Started

1

Install Talon

Install via pip with required workaround for Python 3.11+:

$pip install talon
2

Apply Python 3.11+ Workaround

Required fix for cchardet dependency:

1# Import workaround BEFORE importing talon
2import sys
3import chardet
4sys.modules['cchardet'] = chardet
5
6# Now safe to import talon
7import talon
8from talon import quotations

This workaround is required for Python 3.11+.

3

Extract Reply Content

Basic usage for plain text and HTML:

1from talon import quotations
2
3email = """Great work on the project!
4
5On Mon, Apr 11, 2011 at 6:54 PM, Bob wrote:
6> Can you review the document?
7> Need feedback by Friday.
8"""
9
10clean_reply = quotations.extract_from_plain(email)
11# Result: "Great work on the project!"

Performance & Accuracy

Talon has been tested on 64 real-world emails from various clients and languages.

Test Results Summary

MetricValue
Total Tests64 emails
Passed60 (93.8%)
Failed4 (6.2%)
Avg Processing Time1.92ms
Throughput488.6 emails/second
Min/Max Time0.13ms - 21.55ms

Test Coverage

  • 22 HTML emails: Gmail, Outlook, Apple Mail, Thunderbird, Mail.ru, Hotmail
  • 42 plain text emails: Various formats and reply styles
  • 6+ languages: English, Japanese, Swedish, Polish, Dutch, German
  • Mobile clients: iPhone, Android “Sent from” signatures

Processing Time by Complexity

Email TypeAvg TimeComplexity
Simple text reply0.2-0.5msLow
HTML Gmail/Outlook2-4msMedium
Complex threads4-22msHigh

Speed vs Accuracy Tradeoff

LibraryAvg Processing TimeAccuracyBest For
Talon1.92ms93.8%Production systems needing HTML support
qutoequail0.96ms~85%Moderate accuracy requirements
Custom regex0.1ms~70%Simple plain text, speed critical

Insight: For production systems, 1.92ms average is negligible. Even at worst case (21.55ms), Talon is faster than most network requests.


Known Limitations

Talon failed 4 out of 64 test cases. Here’s what didn’t work:

Test Case 1: Complex Email Thread with Mixed Content

Input:

Thank you, Sonya Johnson.
I have sent an invite for 10:30am Monday PDT (today). I
hope you can join.
Regards,
Christopher Edwards
On Mon, Jun 3, 2024 at 12:53 AM Cody Hart <omerritt@example.com> wrote:
> Hi Christopher Edwards,
>
> 10.30 AM pacific is good for me.
>
> Thanks & Regards,
>
> Cody Hart

Expected Output: First 5 lines only (up to Christopher Edwards)

Talon’s Output: Returns entire email including quoted text starting with “On Mon, Jun 3…” and all ”> quoted text”

Processing Time: 2.55ms

Issue: Signature placement before quotes confuses detection logic


Test Case 2: Inline Responses

Input:

On Tue, Apr 29, 2014 at 4:22 PM, Example Dev <sugar@example.com> wrote:
> okay. Well, here's some stuff I can write.
>
> And if I write a 2 second line you and maybe reply under this?
>
> Or if you didn't really feel like it, you could reply under this line.
I will reply under this one
>
> okay?
>
and under this.
>
> -- Tim

Expected Output: Just the inline responses (I will reply under this one and and under this.)

Talon’s Output: Returns everything including “On Tue, Apr 29…” header and all quoted lines

Processing Time: 0.48ms

Issue: Interleaved inline responses not recognized as the reply pattern


Test Case 3: Gmail Forward HTML

Input:

1<html><head></head><body><div dir="ltr">test<div><br /></div><div>blah</div>
2<div><br /><div class="gmail_quote">---------- Forwarded message ----------<br />
3From: <b class="gmail_sendername">Foo Bar</b>
4<span dir="ltr">&lt;<a href="mailto:foo@bar.example">foo@bar.example</a>&gt;</span><br />
5Date: Thu, Mar 24, 2016 at 5:17 PM<br />
6Subject: The Subject<br />
7To: John Doe &lt;<a href="mailto:john@doe.example">john@doe.example</a>&gt;<br />
8<br /><br /><div dir="ltr">Some text<div><br /></div><div><br /></div></div>
9</div><br /></div></div></body></html>

Expected Output: Just testblah (before the forward marker)

Talon’s Output: Includes ”---------- Forwarded message ----------” and forwarded content

Processing Time: 3.41ms

Issue: HTML forward headers not removed by Gmail quote detection


Test Case 4: Thunderbird Forward HTML

Input:

1<html><body bgcolor="#FFFFFF" text="#000000">
2<p><br /></p>
3<div class="moz-forward-container"><br /><br />
4-------- Forwarded Message --------
5<table class="moz-email-headers-table">
6 <tbody>
7 <tr><th>Subject:</th><td>Re: Example subject</td></tr>
8 <tr><th>Date:</th><td>Tue, 3 May 2016 14:54:27 +0200 (CEST)</td></tr>
9 <tr><th>From:</th><td>John Doe &lt;johndoe@example.com&gt;</td></tr>
10 </tbody>
11</table>
12<br /><br />
13<div>Dear John,</div>
14<div><br /></div>
15<div>This is a test.</div>
16</div></body></html>

Expected Output: Empty (no new content, just forward)

Talon’s Output: Includes ”-------- Forwarded Message --------” and forwarded content

Processing Time: 4.34ms

Issue: Thunderbird’s moz-forward-container class not recognized


Summary: 3 of 4 failures are forwarded messages. Regular replies work with 98%+ accuracy.

Example 1: Simple Gmail Reply

Input:

Awesome! I haven't had another problem with it.
On Aug 22, 2011, at 7:37 PM, defunkt<reply@reply.github.com> wrote:
> Loader seems to be working well.

Talon’s Output: Awesome! I haven't had another problem with it.

Processing Time: 0.2ms

What Worked: Standard “On [date] [name] wrote:” pattern detected, quote marker (>) recognized


Example 2: Outlook Reply with Separator

Input:

Outlook with a reply directly above line
________________________________________
From: CRM Comments [crm-comment@example.com]
Sent: Friday, 23 March 2012 5:08 p.m.
To: John S. Greene
Subject: [contact:106] John Greene
A new comment has been added to the Contact named 'John Greene':
I am replying to a comment.

Talon’s Output: Outlook with a reply directly above line

Processing Time: 0.51ms

What Worked: Outlook separator line (underscores) and “From:”/“Sent:” headers detected as splitter


Example 3: HTML Outlook Reply

Input:

1<html>
2 <body>
3 <div>Reply</div>
4 <span id="OLK_SRC_BODY_SECTION">
5 <div>
6 <span>From: </span>Bob &lt;<a href="mailto:bob@example.com">bob@example.com</a>&gt;<br />
7 <span>Date: </span>Tue, 01 Nov 2011 18:54:39 -0700<br />
8 <span>To: </span>Rob &lt;<a href="mailto:rob@example.com">rob@example.com</a>&gt;<br />
9 <span>Subject: </span>Test<br />
10 </div>
11 <div>Hi</div>
12 </span>
13 </body>
14</html>

Talon’s Output: Reply

Processing Time: 4.02ms

What Worked: Outlook’s OLK_SRC_BODY_SECTION span ID detected and removed structurally

Tradeoff: Talon is more comprehensive but slower than plain-text-only libraries

  • Talon: 1.92ms average (with HTML support)
  • email-reply-parser: 0.03ms average (plain text only)

For production systems, 1.92ms average is negligible. Even at worst case (21.55ms), Talon is faster than most network requests.

As shown in test results, forwarded messages (especially HTML) are challenging:

  • Plain text forwards: Generally work well
  • HTML forwards: May retain forward headers
  • Workaround: Use plain text extraction or post-process to remove forward markers

Error Handling

Always handle potential parsing failures:

1from talon import quotations
2
3def safe_extract(email_body, is_html=False):
4 try:
5 if is_html:
6 return quotations.extract_from_html(email_body)
7 else:
8 return quotations.extract_from_plain(email_body)
9 except Exception as e:
10 # Fallback to original message if extraction fails
11 print(f"Talon extraction failed: {e}")
12 return email_body

Testing Recommendations

Always test with your specific email formats:

1# Create a test suite with your actual email patterns (Gmail, Outlook, Apple Mail)
2test_emails = [
3 "path/to/gmail_reply.html",
4 "path/to/outlook_reply.txt",
5 "path/to/forward.html"
6]
7
8for email_file in test_emails:
9 with open(email_file) as f:
10 content = f.read()
11 result = quotations.extract_from(content)
12 print(f"{email_file}: {len(result)} chars extracted")

Test with real emails from your users’ actual email clients. Talon’s accuracy is based on diverse real-world samples, but your specific use case may have unique patterns.


JavaScript Version

For TypeScript/JavaScript projects, use TalonJS - a JavaScript port of Talon with similar functionality.

Performance Comparison

SolutionAccuracySpeedBest For
Python Talon93.8%1.92msHighest accuracy
TalonJS90.6%1.88msTypeScript/Node.js projects

TalonJS provides 90.6% accuracy with slightly faster performance (1.88ms), making it ideal for JavaScript/TypeScript environments without needing Python dependencies.

Quick Start

1

Install TalonJS

$npm install talonjs
2

Extract Replies

1import * as talon from 'talonjs';
2
3const email = `Great work on the project!
4
5On Mon, Apr 11, 2011 at 6:54 PM, Bob wrote:
6> Can you review the document?
7> Need feedback by Friday.
8`;
9
10const result = talon.quotations.extractFromPlain(email);
11const cleanReply = result.body.trim();
12// Output: "Great work on the project!"

When to use TalonJS vs Python Talon:

  • Use TalonJS if you’re building in TypeScript/JavaScript and 90.6% accuracy is sufficient
  • Use Python Talon if you need the highest accuracy (93.8%) or are in a Python environment
  • The 3.2% accuracy difference is acceptable for most use cases