Better word count method?


#1

Continuing the discussion from FoldingText 2.0 release (almost):

Any recommendations on how to do the best word count. I’d like a plain text algorithm, not one that relies on FoldingText parsing. How does this sound?

Edited

Ok on quick second thought my first example didn’t work very well. Among other things \w doesn’t work for lots of languages. So what is best way to do word count? Does anyone know of any “best” method?


#2

Unless I’m missing something (quite possible) it seems like the only way to do a smarter word count (instead of just breaking spaces) is to just manually enter all the possible word break characters. Here’s what I’m doing now. It basically says words are anything not listed in the regex:

text.match(/[^~`!¡@#$%^&*()_\-+={}\[\]|\\:;"'<,>.?¿\/\s]+/g).length;

The issue being that there’s probably a bunch of word break characters that I’ve missed, so will likely have to keep updating this over time. Anyway I’ll commit this now and we can go from there. It at least seems to solve the problem of counting Markdown syntax chars.


#3

The latest version of 2.0 now uses this updated word count method.


#4

Looks good to me. The filtered word count plug in so far also seems good, one thing I noticed however i is that the filtered word count value will not count as a word text such as 2.0 the decimal point seems to mess it up, I think as the filter is suppose to be targeting blockquotes only then numbers with a decimal place should perhaps be counted?

Was also thinking the filtered count value should also not count in-text / inline references such as (2014 Grosjean) however I’m probably getting way too specific for a more general plug in which will help with basic endnote / footnote essay writing and such and not what I use the in-text harvard citation method


#5

I would like to have a word counter that ignores certain formats that can easily be changed (menu items). That way, I can exclude code segments from my word count.


#6

This plugin shows on way to do that. You’ll have to modify because it’s filtering out quote blocks, but modification shouldn’t be too hard.


#7

There is something wrong with all the word counters like this one. On this simple text:

Due: June 18, 2014

Teaser

Now that you are learning to use Vim, you realize that there are a lot of options that you can configure. But, you do not know how to make those options survive a reboot. I will teach you the basics of using Vim’s configuration file so that you can set up your Vim editor the way you like it. It’s time to take control!

Configure Your Vim

Conclusion

Things To Do.todo

  • Write the article
  • Proof read the article
  • Submit the article

wc get 93 words, FT normal counter get 87 words, and that plugin gets 86 words. Marked has problems also, it gets 92 words! I guess the differences lies in what is being counted as words. But, when I write articles for publication, my editor wants accurate word counts. I believe the difference is different ways on counting what constitutes a word.

I will be taking apart that plugin and try to make what I need. Is there a way to add menu items? I want to be able to disable sections on the fly with a menu if possible.


#8

There’s no “standard” agreed upon way to do word count that I’m aware of. Try to do it “correctly” can get very complex. So I think most tools just cobble together some regexes that seem to work for there context. As an example:

The wc too will report - test as two words, while FoldingText’s built in counter reports it as one. The reason is (I think) because wc counts words by dividing text up by whitespace. While FoldingText counts words by counting ranges of things that contain word characters (abc, etc).

I’ve updated the word count regex that FoldingText uses recently, but I didn’t update the plugin that I linked to to match. So what’s why the plugins count is different then the internal word count I think.


#9

When you add a command with a description it will get added to the Edit > Run Command… popup menu. But otherwise no there isn’t.

Update @raguay As far as disabling counting for certain sections. One way to do this might be to just count visible (non folded lines?). Then you don’t really need a separate command, you can just use fold commands to decide what is counted.


#10

The old version had different counts for the different plugins (plugin and builtin).


#11

That’s an idea. And if the node is folded, don’t count that node at all. I believe the copy visible script keeps the heading of the folded text. That should be easy to work from. Thanks for the idea.


#12

I made it as a script. I modified the copy unfolded script to not include the fold line, then passed it to a TextSoap macro to clean it to remove anything that I do not count as a word, and then use word count to count the words. That was the fastest way for me to make it. Since I wanted to exclude the Teaser and Due Date blocks, I just folded them out of the way! Works great. You can get it here:

This solution used Alfred and TextSoap.