Chinese Text Cleaner (中文文本清洗)

TEXT CLEANER CJK OCR FIX
Share:

Chinese text cleaner. 8 toggleable rules: collapse whitespace, remove inter-CJK spaces, normalize line endings, collapse blank lines, remove zero-width chars, normalize smart quotes, strip emoji, strip URLs.

RT-CHN-040 · Converters & Units

Chinese Text Cleaner (中文文本清洗)

Chars Before: 0 After: 0 Saved: 0
Advertisement
After results · AD-W1 Responsive

How to use

Toggle rules

8 rules, each toggleable. The first 5 are on by default; the latter 3 (quotes, emoji, URLs) are off (enable as needed).

Paste text

Cleaning is live. The stats bar shows before/after char counts and how many you saved.

Copy the result

One-click copy of the cleaned text.

Chain with other tools

This tool + simp/trad converter + full/half-width converter = complete Chinese text preprocessing pipeline.

Chinese Text Cleaning: Common Issues from OCR, Clipboard, and Scrapers

Chinese text often arrives with "dirty data" when crossing tool boundaries — extra spaces, zero-width characters (commonly injected by Word and PDF copy/paste), smart quotes, mixed full/half-width punctuation, and more. These problems are visually invisible but format-breaking, especially in:

Common sources + cleaning strategies

(1) OCR output: spaces erroneously inserted between Chinese characters (since engines recognise character-by-character). Rule: enable "Remove inter-CJK spaces".
(2) PDF / Word clipboard: often carries zero-width characters (U+200B, U+FEFF, etc.) making documents "look identical but encode differently". Rule: enable "Remove zero-width chars".
(3) Smart quotes: Word auto-converts "..." to "..." (curly quotes). Breaks JSON, SQL, HTML. Rule: enable "Normalize smart quotes".
(4) Web scrapers: often pick up extra whitespace, URLs, emoji. Rule: enable the relevant filter.

Order of operations matters

The tool processes in this order: normalize line endings → strip zero-width → strip URLs → strip emoji → normalize quotes → remove inter-CJK spaces → collapse whitespace → collapse blank lines. The sequence is designed so each rule receives a "clean intermediate state" — standard practice in rules-based text processing.

Privacy

All processing happens in your browser. No text is ever sent to our servers. Safe for confidential content.

Advertisement
After how-to · AD-W2 Responsive

10 Facts about Chinese Text Cleaning

01

Zero-width characters (U+200B, U+200C, U+200D, U+FEFF) are visually invisible. Word, PDF, and Notion frequently inject them on copy, but you can't see them by eye.

02

Smart quotes (curly "" '') are Word's default — pressing the straight " key gets auto-converted. They trigger syntax errors in JSON, SQL, and command-line tools.

03

OCR engines (like Tesseract) split by character — that's why scanned Chinese often comes out as "汉 字 之 间 有 空 格". The tool's "Remove inter-CJK spaces" rule directly addresses this.

04

Regex is critical for Chinese text processing. The CJK Unified range [一-鿿] (U+4E00 – U+9FFF) + Compatibility Extension [㐀-䶿] (U+3400 – U+4DBF) cover 27,000+ characters. The tool's "inter-CJK spaces" rule uses both ranges.

05

U+FEFF (BOM, Byte Order Mark) marks the byte order of UTF-8 files. Windows Notepad inserts it on save; Linux/Mac tools may error on it. The tool strips it along with other zero-width characters.

06

Emoji are "double-byte characters" in CJK text (4-8 UTF-8 bytes each). They cause truncation errors if a database column isn't wide enough. The "Strip emoji" option bulk-removes them.

07

Windows / Mac / Linux use different line endings: Windows = CRLF (\r\n), Mac/Linux = LF (\n), legacy Mac OS = CR (\r). The tool normalises everything to LF — the cross-platform standard.

08

"Line break" and "paragraph break" are different. A paragraph can have multiple soft line breaks (\n). A blank line (\n\n) marks paragraph separation. The "Collapse blank lines" rule preserves one blank line for paragraphs while removing excess.

09

"Regular space" and "full-width space" (U+3000) are different characters. The "Collapse whitespace" rule handles both — merging any whitespace (regular spaces, tabs, full-width spaces, non-breaking spaces) into a single regular space.

10

Pairs with RT-CHN-037 (Simp/Trad), RT-CHN-038 (Full/Half-Width), and RT-CHN-039 (Vertical Layout) — the complete Chinese text-processing toolset.

Frequently Asked Questions

  • Fully local. All processing is in-browser; no text is uploaded. Even confidential content is 100% safe.

  • The first 5: collapse whitespace, remove inter-CJK spaces, normalize line endings, collapse blank lines, remove zero-width chars. These are side-effect-free "safe cleanup". The latter 3 (quotes, emoji, URLs) are opt-in — they may change meaning, so off by default.

  • No. The rule is strictly limited to spaces between two CJK characters — so "汉 字" becomes "汉字", but "I love 中文" is preserved.

  • Preserves one blank line (paragraph separator). Removes excess (3+ blank lines become 1). Preserves paragraph structure while removing redundancy.

  • Curly "" → straight " "; curly '' → straight ' '. Word's "smart quotes" are a common source of JSON and SQL errors — this rule fixes that.

  • Covers major emoji ranges: U+1F300-U+1FAFF (main emoji block), U+2600-U+27BF (symbols + dingbats), U+1F600-U+1F64F (faces). A few edge-case emoji may slip through.

  • Strongly recommended. Zero-width chars, mixed spaces, and smart quotes all cause mysterious "search doesn't match" bugs. Production text input should always be sanitised this way.

  • The original is preserved in the left input box (never overwritten). You can edit, toggle rules, and compare before/after at any time. The result (right) is read-only.

  • Enable everything. PDF clipboard typically carries: zero-width chars, misplaced spaces, smart quotes, possibly emoji/URLs. The tool's "all on" config is ideal for PDF cleanup.

  • Only the first 5 "side-effect-free" rules. Leave "Normalize quotes" off (might alter string literals); leave "Strip emoji/URLs" off (may contain code-relevant content).

Related News

You may be interested in these recent stories from our newsroom.

No related news yet for this tool. Our editorial team publishes new pieces every week.

Browse all news →
Advertisement
Pre-footer · AD-W3 728 × 90

75 more free tools

Calculators, converters, security tools — no signup.