Improving the word count implementation


#1

I’m looking to improve the word count implementation. Goals are:

  1. It should ignore syntax characters such as leading ‘#’ in heading or ‘**’ in bold inline syntax.

  2. It should work as correctly as possible with all languages (CJK) in particular.

  3. It probably won’t all work correctly to start with, but the implementation should me it pretty easy to add fix cases.

From what I can tell there are two reasonable approaches to doing word count. You can focus on what you want to be words and count those ranges, or you can focus on the things that you want to be word breaks and count everything that doesn’t match as a word. So as a simple example of those two cases:

  1. Count all non whitespace ranges as words
  2. Or count all alphanumeric ranges as words

I think to get CJK correct you must take the approach of counting what you want to be words (or do multiple passes) since they don’t use whitespace. With that in mind this is is the most appealing framework of a solution that I’ve come across is:

var wordRegex = new RegExp(
  '[A-Za-z0-9_'\u00C0-\u017F]+|'+ // ASCII letters +accents
  '[\u3040-\u309F]+|'+ // Hiragana
  '[\u30A0-\u30FF]+|'+ // Katakana
  '[\u4E00-\u9FFF\uF900-\uFAFF\u3400-\u4DBF]', // Single CJK ideographs
  'g');
}

It’s pretty simple, and generality does what I want. The problem is that I’m a bit concerned about how many unicode ranges I’m going to have to put in there to get things right. And I don’t know where to find a simple list of all the ranges that I’ll need.

Can anyone who knows about this sort of things make a recommendation? Should I take this bruit force implementation of listing all ranges that I want counted as words?

The other approach that I can think of is do two passes. In the first pass match languages that don’t have word breaks like CJK and just replace each match with ’ a '. Then make a second pass where I look for spaces and other common word break characters. This “might” make things easier since I wouldn’t need to list ever language, and I assume most languages DO use spaces and such for word breaks.

Anyway, any thoughts, links, etc would be helpful.

Thanks!


#2

In Chinese people tend to count “characters” not “words.” That’s because the lack of spacing makes it difficult to count words without using a dictionary. Since most Chinese speakers are used to counting characters instead of words, it is fine to do it this way - although you should probably make it clear that this is what is being counted.

I hope that helps - I don’t really understand the question the way it is phrased above.


#3

I guess the basic question is should I concentrate on charter classes that divide words (space being the simplest example) or on character classes that define words (a-z) being the simplest example).

By that definition the above regex is concentrating on character classes that define words. The issue that I have with that is I don’t have any idea of how many character classes I’m going to need to define in the end to cover “most” languages. Anyway… I’ll keep with this approach unless someone knows it’s really better to take the other approach.