Text Cleaner
Tidy up messy text in one pass — remove duplicate and blank lines, trim and collapse whitespace, strip HTML tags, sort or reverse lines, convert tabs and spaces, add a prefix or suffix and number lines. Pick the operations you need and the cleaned result updates live. Everything runs in your browser — nothing is uploaded.
0 lines · 0 characters
0 lines · 0 characters
What each operation does
| Operation | What it does | Example |
|---|---|---|
| Trim lines | Removes leading and trailing spaces and tabs from every line. | "␣␣hello␣␣" → "hello" |
| Collapse spaces | Replaces runs of spaces/tabs inside a line with a single space. | "a␣␣␣␣b" → "a␣b" |
| Remove blank lines | Deletes lines that are empty or contain only whitespace. | drops the empty rows |
| Remove duplicate lines | Keeps the first occurrence of each line, drops later repeats. | apple / apple → apple |
| Strip HTML tags | Removes <…> markup, leaving just the text content. | "<b>hi</b>" → "hi" |
| Tabs ↔ spaces | Converts each tab to N spaces, or every N spaces to a tab. | tab → 4 spaces |
| Sort / reorder lines | Sorts A→Z or Z→A (numeric-aware) or reverses the current order. | file2 before file10 |
| Prefix / suffix | Adds your text to the start and/or end of every line. | "- " + line |
| Number lines | Prepends a 1. 2. 3. … counter to each line. | "1. first" |
| Join lines | Combines all lines using a newline, space, comma or nothing. | a / b → "a, b" |
How it works
The Text Cleaner applies your chosen operations in one fixed, predictable pipeline so the same input and settings always produce the same result. First it optionally strips HTML tags, then it works line by line (converting tabs/spaces, collapsing runs of spaces, and trimming the edges), then it removes blank and duplicate lines, sorts or reverses, adds any prefix/suffix, numbers the lines, and finally joins them back together with your chosen separator. Nothing is sent to a server — the whole thing is plain JavaScript running on your device, so it works offline and is safe for private text.
Common uses
- De-duplicate and sort a list of emails, keywords, URLs or IDs.
- Clean up text pasted from a PDF, spreadsheet or chat (extra spaces and blank lines).
- Strip HTML tags out of a copied web snippet to get plain text.
- Turn a column of values into a comma-separated list (or back into lines).
- Add bullet markers, quotes or commas to every line, or number a list.
- Normalise indentation by converting tabs to spaces (or the reverse).
Why pasted text is full of invisible junk
Text rarely arrives clean. When you copy from Microsoft Word, Google Docs, a PDF, an email
client or a web page, the visible words come with a layer of formatting characters you never
see. Word processors silently insert non-breaking spaces, curly quotes and soft hyphens; web
pages contribute entities and stray zero-width characters used for
layout; and PDFs are the worst offenders of all, because their text is really a set of
positioned glyphs. Copying from a PDF often yields hard line breaks in the middle of
sentences, hyphenation left over from justified columns, ligatures such as fi in
place of "fi", and runs of spaces standing in for table columns. Paste any of this into a
code editor, a spreadsheet, a database field or a search box and it can break in ways that
are maddening to debug, precisely because the culprit is invisible. Cleaning text is largely
the art of stripping that hidden layer back to plain, predictable characters.
Line endings: LF, CRLF and the old Mac CR
Every line of text ends with an invisible control character, and three different conventions
are still in active use. Unix, Linux, macOS and the modern web use a single line feed,
LF (\n, U+000A). Windows and DOS use a carriage return followed by
a line feed, CRLF (\r\n, U+000D U+000A) — the same pair that HTTP
headers and many internet protocols mandate. Classic Mac OS, up to version 9, used a lone
carriage return, CR (\r). Mixing them causes real problems: a file
with stray carriage returns shows up as ^M in diffs and some editors, breaks
shell scripts with "command not found" errors on the interpreter line, and can make two
visually identical files compare as different in version control. This cleaner normalises all
three conventions to a single line feed before it does anything else, so the rest of the
pipeline can treat every line consistently.
Invisible and look-alike characters that break things
A handful of Unicode characters are either completely invisible or visually identical to an ordinary space, yet they are not the characters a computer expects. Because you cannot see them, they survive copy-paste and quietly sabotage exact-match search, spreadsheet lookups, CSV imports, form validation and string comparison in code — the word "total" and the word "total" with a hidden zero-width space on the end are simply not equal.
| Character | Code point | Why it causes trouble |
|---|---|---|
| Zero-width space | U+200B | Adds an invisible line-break opportunity; breaks search and matching, splits identifiers. |
| Zero-width non-joiner | U+200C | Controls letter joining in Arabic and Indic scripts; stray copies corrupt lookups. |
| Zero-width joiner | U+200D | Glues emoji and conjuncts together; loose ones leave unmatchable strings. |
| Byte-order mark / ZWNBSP | U+FEFF | Valid only at the very start of a file; mid-text it is a phantom character — the classic reason a CSV's first header never matches. |
| Non-breaking space | U+00A0 | Looks like a space but is not an ASCII space; defeats naïve trimming and word splitting (often pasted from HTML or Word). |
| Soft hyphen | U+00AD | Invisible unless a line wraps there; pollutes copied text and breaks exact search. |
Most of these arrive from web pages and word processors that use them for typesetting. Stripping them out or replacing them with an ordinary space is one of the most valuable things a cleaner can do, because the symptom — "my lookup says these two cells differ when they look the same" — gives you almost no clue where to look.
Smart quotes and dashes vs their ASCII twins
Word, Google Docs and many publishing systems automatically "improve" your typography as you type: straight quotes become curly ones and double hyphens become dashes. For prose this looks polished, but for anything a machine has to parse it is a frequent source of cryptic failures. The straight apostrophe and quotation mark on your keyboard are ASCII U+0027 and U+0022; their typographic replacements are the left and right single quotes U+2018 and U+2019 and the left and right double quotes U+201C and U+201D. Likewise the plain hyphen-minus U+002D is often swapped for an en dash U+2013 or an em dash U+2014.
The trouble is that these look-alikes are not interchangeable in code. JSON, for instance, requires the straight double quote U+0022 around every key and string value; paste a curly quote in its place and the parser rejects the whole document. An em dash dropped into a configuration file, a CSV, a URL or a database identifier usually survives a visual review — it reads as a hyphen — but fails the moment software treats it literally. The fix is to convert smart punctuation back to its ASCII equivalent before the text reaches code, or to compose machine-bound text in a plain editor in the first place. Replacing curly quotes with straight quotes, and en or em dashes with a plain hyphen, is a routine, safe normalisation for any text headed into a program.
Unicode normalization: NFC, NFD, NFKC and NFKD
The same visible text can be stored as different sequences of code points. The accented letter é can be a single precomposed character (U+00E9) or a plain "e" followed by a combining acute accent (U+0301). They look identical and mean the same thing, but byte for byte they are different strings, so a search or comparison can miss one while matching the other. Unicode normalization, defined by Unicode Standard Annex #15, rewrites text into one of four canonical shapes so that equivalent strings become identical.
| Form | Name | What it does |
|---|---|---|
| NFC | Canonical composition | Decomposes, then recomposes to precomposed characters. The best default for storing and sending text. |
| NFD | Canonical decomposition | Splits characters into a base plus combining marks. Handy for internal processing such as stripping accents. |
| NFKC | Compatibility composition | Also folds compatibility variants — ligatures, superscripts, full-width forms — to plain equivalents, then composes. Preferred for identifiers and security checks. |
| NFKD | Compatibility decomposition | The fully decomposed compatibility form, used mainly for matching and indexing. |
The "K" (compatibility) forms are powerful but lossy: NFKC turns the fi ligature into "fi", the full-width A into a normal "A" and the superscript ² into "2". That is exactly what you want for loose matching and search, but it erases formatting distinctions, so the Unicode Consortium warns against applying NFKC or NFKD blindly to arbitrary text. For most cleaning jobs NFC is the safe choice; reach for NFKC only when you specifically want those compatibility characters flattened.
Homoglyphs and Unicode spoofing
Some characters are not invisible but identical-looking, and that is a security problem as
much as a tidiness one. The Latin, Cyrillic and Greek alphabets each contain letters that
render the same in almost every font: Latin "a" (U+0061) and Cyrillic "а" (U+0430) are
indistinguishable on screen, as are Latin "o" (U+006F) and Cyrillic "о" (U+043E).
Substituting one for the other is called a homoglyph attack. In its
best-known form, an internationalized-domain-name homograph attack, a registrant builds a web
address out of look-alike characters so that a malicious site appears to read
apple.com while actually being a different domain encoded in Punycode (the
xn-- prefix you sometimes see). A widely cited 2017 demonstration registered such
a domain that some browsers displayed as "apple.com".
The same trick shows up far from the address bar: a Cyrillic letter hidden in a username, a coupon code, a spreadsheet key or a source-code identifier can let two "identical" values slip past validation or quietly fail to match. Restricting text to a known alphabet, or at least flagging mixed-script content, is the defence — and the first step is always to make those hidden characters visible, which is exactly what cleaning the text lets you do.
Choosing which clean-up to apply
Cleaning is not one operation but a toolbox, and the right choice depends entirely on where the text is heading. A transformation that rescues a messy list can quietly corrupt a code file, so it pays to match the operation to the destination.
- Strip HTML tags when you want the readable text out of a copied web snippet or an email signature. It is the wrong move if you actually need the markup — stripping tags from a template or an email body throws away the very structure you were trying to keep.
- Remove emoji and symbols for fields that must stay machine-clean: usernames, file names, product SKUs, CSV keys and URLs. For ordinary prose, social captions or chat logs, removing them usually destroys meaning rather than tidying it.
- Collapse blank lines — reducing several empty rows to one — to tidy prose pasted from a PDF or email, where stray paragraph breaks pile up. Be careful with formats where blank lines are meaningful, such as Markdown (paragraph separators) or fixed-width data.
- Collapse runs of spaces to repair text where tabs or alignment spacing have been flattened into long gaps. Never apply it to source code, where indentation is significant, or to anything column-aligned by spaces.
- Case folding — forcing everything to lower or upper case — is for building case-insensitive comparison keys and de-duplicating lists, not for display text, where it would wreck proper nouns, acronyms and sentence capitalisation.
The golden rule is that whitespace and case are significant in many machine formats. Python and YAML depend on exact indentation, Makefiles require real tab characters, and Markdown treats blank lines and leading spaces as structure. When the text is destined for one of those, clean the invisible and look-alike characters that genuinely cause bugs — zero-width characters, non-breaking spaces, smart quotes, mismatched line endings — but leave the meaningful whitespace alone. Because this tool applies its steps in a fixed order and shows the result instantly, you can toggle one operation at a time and watch exactly what each change does before you trust the output.
Frequently asked questions
- What does the Text Cleaner do?
- It bundles the most common line and whitespace clean-up jobs into one pass: removing duplicate and blank lines, trimming and collapsing whitespace, stripping HTML tags, sorting or reversing lines, converting tabs and spaces, adding a prefix or suffix, numbering lines and joining everything together. Turn on the operations you need and the cleaned result updates instantly.
- In what order are the operations applied?
- Always the same predictable order so the result is repeatable: 1) strip HTML, 2) per-line — convert tabs/spaces, collapse spaces, then trim, 3) remove blank lines, 4) remove duplicates, 5) sort or reverse, 6) add prefix/suffix, 7) number lines, 8) join. Because the order is fixed, the same input and the same settings always give the same output.
- How does “Remove duplicate lines” decide what is a duplicate?
- It compares the whole line after the earlier steps have run (so trimming and collapsing happen first). The first time a line appears it is kept; any later identical line is dropped, preserving order. Tick “Case-insensitive” to treat “Apple” and “apple” as the same line.
- Will it change the order of my lines?
- No — unless you choose a Sort option or “Reverse order”. With sorting set to “None”, every operation preserves the original line order (duplicates are removed in place, blanks are removed in place).
- Does sorting handle numbers correctly?
- Yes. Sorting is numeric-aware, so “file2” comes before “file10” instead of the plain-text order that would put “file10” first. Combine it with “Case-insensitive” to ignore capitalisation when sorting.
- Is my text uploaded anywhere?
- No. Every operation runs entirely in your browser using JavaScript — your text never leaves your device. You can use the tool offline and on sensitive content with no privacy risk.