Converting MHTML to HTML

I recently had a need to convert an MHTML file attachment into HTML text so that it could be used as the body of a notification service email. Initially I was overwhelmed by the nature of MHTML’s encoding and document structure. Eventually I settled on the following relatively compact LINQ statement:

public static String ToHtml(String mhtml) {
    if (mhtml == null) throw new ArgumentNullException("mhtml");

    const String partBoundary = "------=_NextPart";
    const String startToken = "Content-Transfer-Encoding: base64";
    const String htmlContentType = "Content-Type: text/html;";

    return mhtml.Split(new [] { partBoundary }, StringSplitOptions.RemoveEmptyEntries)
                .Where(t => t.IndexOf(htmlContentType, StringComparison.Ordinal) > 0)
                .Where(t => t.IndexOf(startToken, StringComparison.Ordinal) > 0)
                .Select(t => t.Substring(t.IndexOf(startToken, StringComparison.Ordinal)))
                .Select(t => t.Replace(startToken, String.Empty).Trim())
                .Select(t => Encoding.Default.GetString(Convert.FromBase64String(t)))
                .FirstOrDefault();
}

The idea is to separate the MHTML into its parts as there can be multiple (more on that later). I then locate the text/html part which I’m interested in do some trimming and cleansing of boundary identifiers. This leaves me with a single Base64-encoded String, which needs to be converted into a Byte[] and finally into the human-readable HTML document.

As I alluded to earlier, MHTML can be (and often is) composed of more than a single part. These parts can be anything that the HTML references — including images. The solution above does not handle image data; it does not even make the attempt. For my purposes this was fine. You may find it unacceptable in other scenarios.

If someone were to complete this solution to be more general-purpose I would recommend using the same approach above. Instead of discarding the non-HTML parts, you would want to loop through the set created by Split() and handle each depending on its Content-Type.

× Comments are closed.