How to Clean Up HTML from Microsoft Word Before Publishing to Web or CMS
Struggling with junk HTML from Word? You're not alone. Whether you're pasting into WordPress, sending emails, or managing CMS content, Word adds an invisible mess behind the scenes. This guide shows you how to clean it all up – fast, reliably, and at scale.
Table of Contents
- Why Word HTML Is a Problem
- Real Examples of Bad Word HTML
- Why Cleaning HTML Matters (SEO & Accessibility)
- Common Cleaning Methods (and Why They Fail)
- The Puritext Approach (Fast & Accurate)
- Before & After Cleanup
- Puritext vs. Other Tools
- Bonus: API for Developers
- FAQ
Why Word HTML Is a Problem
Microsoft Word is great for writing, but terrible for the web. When you paste from Word, you're not just pasting text – you're injecting dozens of hidden styles, classes, and legacy Office formatting like:
<span style="mso-bidi-font-style:italic">
<o:p></o:p>
tags- Hard-coded line heights, colors, and font families
- Useless nested
<div>
or<span>
wrappers - Invisible symbols, unbreakable spaces, junk characters
Real Examples of Bad Word HTML
Here’s what a simple paragraph from Word can turn into:
<p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height:107%">
<span style="font-family:'Calibri',sans-serif">This is from Word.<o:p></o:p></span></p>
And here’s how it should look:
<p>This is from Word.</p>
Why Cleaning HTML Matters (SEO & Accessibility)
- Cleaner code = faster site (Google cares about performance)
- Semantically correct HTML improves accessibility and screen-reader support
- Less risk of design-breaking bugs caused by rogue inline styles
Common Cleaning Methods (and Why They Fail)
- Notepad: removes all formatting – including useful tags like
<strong>
or<em>
- Online HTML Cleaners: often break links, escape valid tags, or don’t support Word artifacts
- Manual cleaning: slow, error-prone, impossible to scale
The Puritext Approach (Fast & Accurate)
Puritext is built specifically to clean up pasted content while preserving structure. Key features:
- Preserve
<p>
,<ul>
,<strong>
,<a>
- Remove MS Office tags, inline styles, symbols, emojis
- Supports output formats: plain, HTML, CMS, email, markdown
- Available as web app or via API
Before & After Cleanup
Input:
<p style="font-family:Calibri">Hello from Word! 😬</p>
Cleaned with Puritext:
<p>Hello from Word!</p>
Puritext vs. Other Tools
Feature | Puritext | Notepad | WordHTMLCleaner |
---|---|---|---|
Preserve useful HTML | ✔ | ✖ | ✔ |
Remove Word-specific styles | ✔ | ✖ | ~ |
Custom format output | ✔ | ✖ | ✖ |
API support | ✔ | ✖ | ✖ |
Bonus: API for Developers
Automate your cleanup by calling the Puritext API:
curl -X POST https://puritext.com/api/clean \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"text": "<p>Pasted from Word</p>",
"format": "cms",
"remove_html": false,
"remove_emoji": true
}'
FAQ
Can I clean HTML without losing <p>
and <strong>
tags?
Yes, Puritext removes junk but keeps useful semantic tags intact.
Does it work with content from Outlook emails?
Yes. Outlook produces similar HTML mess, and Puritext handles it just as well.
Can I batch clean content via API?
Absolutely. You can clean hundreds of posts via script using the API.
Want to try it yourself? Try Puritext online or explore the API docs.