Wordcounting practices

How We Count Words at JTS

At JTS, we do not use the word-counting function in Word or any other particular word processor. Instead, we divide by six the number of bytes in the file after saving it as a text file.

Specifically, after removing all extraneous spaces¹, we save the file as an ASCII/ANSI text file², then divide the resulting file’s byte count by six to get the word count. We have defined six characters as one word because six characters is the old printer’s rule of thumb for defining a word (five characters), plus one for the trailing space or punctuation mark (more on this later).

To keep things simple and transparent, we use this formula consistently for everyone we work with: all translators, editors, and customers. If you ever have a question about the word count you are credited (or billed) for, ask. We’ll send you the file that was used to calculate it.

Why not use word processors?

We use this procedure, and not the word-counting function in any particular word processor, because every word processor—and even different versions of the same one (e.g., Word 3.1, Word95, Word97, and Word98/2000)—employ different algorithms for counting words, with the result that each reports a different word count even for the exact same file. But saving the same file as text with different word processors consistently delivers the same byte count to within two to four bytes, an insignificant difference when six bytes is considered one word (at least, back when I experimented it did). And, as long as they use our procedure, anyone can get just about the same word count from any given file, regardless of the application they’re using.

How much deviation is there between JTS’s and Word’s word counts.

We have tracked the deviation between our word count and that reported by Word for several months and been unable to establish a consistent figure: the word counts can differ by as little as 3% but as much as 24%; the average is 8–10%. Deviation tends to be larger with files that contain a lot of numerals and special symbols, and smaller with those that are text intensive. Also, the larger the file, the smaller the difference tends to be. Other than that, we’ve been unable to establish a pattern. Our word counts, by the way, are always higher than those reported by Word (there is a threshold at which the tendency reverses with very big files, but we’ve never encountered it in a practical sense).

What happens with translations delivered as files created in PowerPoint, Excel, or some other non-word processor application?

PowerPoint and Excel have word-counting features, but they ignore text in text boxes and probably have the same inherent problems that word processors have. So, to keep this consistent, our solution has been to apply our word counting formula based on text files—i.e., the content of files delivered in PowerPoint, Excel, and other like formats is dumped into text files and the text files’ byte counts are used for arriving at a word count.

With Excel files, we copy the sheets into Word, undo the tables, and then treat them as text files. Anything in text boxes in also put into the Word file as regular text.

With PowerPoint, we use a macro to move all text content into a Word file, then use the same procedure as for a Word file: dump to text and divide the byte count by six.

Any trouble using this formula?

We’ve used the “six bytes to a word” formula since we started JTS in 1992, and it predates our use of MS-Word as our principal word processor. I (Jim Lockhart) have used this scheme since 1986, when I learned of it working for another translation company. Back then, all files were submitted as ASCII text files.

So far, we’ve had no significant complaints from associates or customers about this practice, and no one has asked us to use something different once it was explained to them. For what it’s worth, we explain that the basis for our word counts to all first-time customers: We charge by the word and define a word as six bytes in a clean text file generated from the largest file in our translation and editing process.

Let me know if you have any questions or comments.

1.“Extraneous spaces” are space characters (spaces, tabs, and carriage returns) not essential to the document: doubled spaces, doubled tabs, and doubled carriage returns are all converted to a single space, tab, or carriage returns, and spaces and tabs immediately before carriage returns are removed. (back to text)

2.Under Windows XP and Word 2002 or above, we save the file as “Windows default” with conversion or as “Other: Western European (Windows)” without conversion. Saving as MS-DOS with conversion or US-ASCII usually delivers the same result, but some other settings result in slightly bloated character counts. (back to text)

©1996–2017, Japan Translation Services. All rights reserved.