Chinese Character Frequency (汉字使用频率)

CHARACTER FREQUENCY CORPUS LEARNING
Share:

Chinese character frequency lookup. Top 200 most-used characters with rank and cumulative coverage %. Based on Jun Da modern corpus data (public domain).

RT-CHN-033 · Converters & Units

Chinese Character Frequency (汉字使用频率)

Advertisement
After results · AD-W1 Responsive

How to use

View the Top 50

See the most frequent 50 characters at a glance — these 50 cover ~35% of modern Chinese text.

Search by character

Enter a Hanzi, pinyin, or English meaning — tool returns rank and cumulative coverage %.

Gauge text difficulty

When writing, check the rank distribution of characters used — more high-frequency = more readable.

Prioritise high-frequency learning

Beginners should master the top 200 first — fastest path to reading competence.

Character Frequency: Why the Top 1,000 Characters Matter Disproportionately

Chinese character frequency follows Zipf\'s Law: a small handful of characters appear extremely often, while the vast majority almost never do. Concretely: the top 100 characters cover ~42% of all text, the top 1,000 cover ~91%, and the top 3,500 cover ~99.5%. Mastering the first 3,500 makes you essentially fluent at reading newspapers, novels, and official documents.

The learning leverage this creates

For learners, this is an astonishing efficiency lever: each additional high-frequency character bought far more reading capability than a low-frequency one. A learner who masters the first 500 characters has roughly 80% of the reading capability of one who masters 3,500. This is exactly why HSK and other curricula sequence vocabulary by frequency rather than by alphabetical or topical order.

Corpora and frequency statistics

Modern Chinese character frequency data comes from corpora — large collections of contemporary text (newspapers, novels, online writing, official documents) with character occurrences counted. Jun Da (段俊), professor at Middle Tennessee State University, compiled the most widely-used open-data "Modern Chinese Character Frequency List" in the 1990s, drawing on millions of characters of modern text. This tool\'s data is based on Jun Da\'s rankings.

Note: frequency rankings vary somewhat across corpora (news vs. literature vs. internet chat). This tool uses Jun Da\'s general-purpose modern corpus.

Advertisement
After how-to · AD-W2 Responsive

10 Facts about Chinese Character Frequency

01

"的" is the most frequent Chinese character, accounting for ~4.1% of modern text. One in every 24 characters you read is 的.

02

The top 10 characters cover 9% of text: 的, 一, 是, 不, 了, 在, 人, 有, 我, 他. After learning these 10, you recognise 1 out of every 11 characters in any text.

03

Zipf's Law holds strongly for Chinese character frequencies — the character ranked N appears roughly 1/N as often as the rank-1 character. This pattern holds across nearly all natural languages.

04

The top 1,000 characters cover 91% of modern Chinese text. The top 3,500 cover 99.5%. Mastering 3,500 characters makes a learner essentially fluent for most modern reading.

05

China's Ministry of Education mandates 3,500 common characters for primary and secondary education. This number was chosen via frequency analysis — corresponding to 99.5% text coverage.

06

Total Chinese character count far exceeds 3,500. The Kangxi Dictionary lists 47,035. The Hanyu Da Zidian lists 55,000. Unicode CJK encodes 97,000+. But the vast majority are historical, name-specific, or technical characters rarely used in modern daily writing.

07

The same character has different frequency in different corpora. News articles → 政, 济, 府 rank higher. Classical literature → 之, 乎, 也. Online chat → 啊, 呢, 哈. This tool uses general-purpose modern corpus data.

08

Simplified vs. traditional doesn't affect frequency rankings (1:1 mapping), but stroke count differs dramatically. 龍 → 龙 (16→5 strokes), 邊 → 边 (19→5). This was the original motivation for simplification — making the most frequent characters easiest to write.

09

Recognising ≠ writing. An educated Chinese reader can recognise 4,000-6,000 characters but may only be able to write 2,500-3,500 unaided. "IME dependence" — losing handwriting fluency due to typing-only use — is a real modern phenomenon.

10

Pairs with RT-CHN-031 (Chengyu Dictionary) and RT-CHN-032 (HSK Vocabulary) — the three pillars of systematic Chinese learning.

Frequently Asked Questions

  • Professor Jun Da's "Modern Chinese Character Frequency List," based on statistical analysis of multiple modern Chinese text corpora. One of the most-cited public-domain datasets in Chinese computational linguistics.

  • The corpus determines the ranking. News corpora boost 政/济; literary corpora boost 云/叶; internet chat boosts 啊/哈. Jun Da's data is general-purpose but still reflects 1990s-era text patterns.

  • Yes — verified across multiple independent corpora. This is Zipf's Law in action, a statistical pattern that holds across nearly all natural languages.

  • Roughly yes, but not strictly. Prioritise high-frequency characters, but also consider semantic grouping (kinship terms together: 父, 母, 兄, 弟, 姐) and radical patterns (月-component: 明, 朋, 肝). HSK curricula balance frequency with cognitive efficiency.

  • Depends on use case. Not needed for daily reading. But essential for classical literature, medicine, law, personal/place names. E.g. 鼎 and 鼐 are low-frequency but appear often in historical texts.

  • Simple summation: add up the individual frequencies of the top N characters. E.g. the top 3 (的, 一, 是) have individual frequencies 4.1%, 1.7%, 1.2% → cumulative 7.0%. Meaning: knowing these 3 chars, you recognise 7% of any text.

  • Mostly the same — most characters map 1:1 (简龙 ↔ 繁龍). But edge cases exist: some simplified characters merge multiple traditional ones (干 = 乾/幹/干), which can shift rankings. This tool uses modern simplified corpora.

  • Core high-frequency characters are nearly identical across regions (的, 一, 是 — all top 3 everywhere). Differences emerge in the mid-frequency range (rank 200-2000) reflecting local culture, industry, and news topics. For learning purposes, the top 1,000 are nearly universal.

  • It affects IME design, not actual character frequency. Pinyin IMEs surface high-frequency characters first; Wubi IMEs encode by radical — but the actual characters people type remain high-frequency (because daily expression needs drive frequency).

  • Currently uses Jun Da's 1990s data (last updated 2004). Core high-frequency characters have been stable for decades, but mid- and low-frequency ranges may shift. Future plan: incorporate more recent corpus data.

Related News

You may be interested in these recent stories from our newsroom.

No related news yet for this tool. Our editorial team publishes new pieces every week.

Browse all news →
Advertisement
Pre-footer · AD-W3 728 × 90

75 more free tools

Calculators, converters, security tools — no signup.