Counting words, characters and lines is a subject close to the heart of all freelance translators, as it's the basis for job costing and getting paid. This article considers some of the issues involved. There are tools, such as Total Assistant, to help with the process character, word and line counting. This article explains some of the things to watch out for if you don't use a tool to help produce text counts.

Different word processors and translation tools very often produce different word count values for the same document, though typically not wildly different. The differences can be due to the use of different rules for counting as well as deficiencies in the applications used.

Many people rely on the document property statistics produced by Microsoft Word to determine the word and character counts. In many instances this is perfectly good. There are however a few things to be aware of that Word gets wrong, as will be detailed.

When handling a large number of documents at a time, getting an overall word, character or line count for all documents can be time consuming, especially if this means opening several documents in Microsoft Word, noting the count values for each file and then totalling them all together. There are third party tools to automate the process of counting words. These allow a number of files, which may be of different formats, to be selected and the word/character counts are then summarised and totalled. When faced with many files, these are real time savers. For example, when working with web pages, it's quite common for a customer to supply dozens of separate files. The utilities typically support multiple file formats such as Word, HTML, PDF, PowerPoint, Excel, text etc. These dedicated word counting tools can also be more accurate as they don't have the deficiencies that standard applications, such as Microsoft Word have. The following table shows the word count from three different applications, including Microsoft Word, for a set of test documents.

Test Document Microsoft Word or PowerPoint Total Assistant TRADOS Test
EmbeddedObjects.doc 21 43 43 Text in autoshapes
FirstOddEvenHeaders.doc 0 25 23 Header & footer types
GroupedShapes.doc 0 22 0 Grouped shapes
HeadersEtc.doc 33 70 70 Header, footnotes and end notes
MoreGroupedShapes.doc 0 13 0 More grouped shapes
TestEmbedded.ppt 60 80 0 Embedded OLE objects

The reasons why Microsoft Word and PowerPoint show erroneously low word counts are detailed later in this article.

Scanned and electronically FAXed images are another matter. These will typically be in bmp (bitmap), jpg (JPEG), gif (graphic interchange format) Tiff or some other graphical format. Acrobat PDF documents or Word documents may also contain scanned images. Text in a scanned image is not stored in the form of a character encoding, but is described like a picture and made up of coloured dots, known as pixels. To allow a computer program to count words, it is first necessary to convert the graphical image back into a character encoded format, such as Word, RTF (rich text file), Text etc. This can be done with the aid of an OCR (optical character recognition) application. There are several OCR applications on the market.

Microsoft Word.

Lets now consider Microsoft Word in more depth and the areas in where caution is needed. Word basically counts words by assuming everything between spaces is a word, this includes symbols such as % & @ * #. Translation tools are generally a bit smarter and will not include these symbols as words.

Text from text boxes, auto-shapes, grouped shaped, headers, footers and comments is not included in the Word-generated document statistics. Headers and footers usually contain little text so the error introduced by Word from ignoring this text is minor. More significantly can be the use of text boxes. Some document authors use a lot of text boxes, particularly to annotate drawings or help produce complex text layouts. In these cases, ignoring this text can produce large errors, causing the word count to be far too low.

Microsoft Word does count numbers as words. For example, 4.7, would count as one word. Some other packages may exclude numbers from the word count. General opinion seems to be split on how to consider numbers. Some say that as numbers don't need to be translated they should not be included and others say that as numbers need to be transcribed and checked for errors they should be included. For documents that contain only a few numbers the difference is typically not significant.

Word does not count the text contained in any embedded objects. These objects, sometimes also known as OLE objects, are inserted into a Word document through use of the Word 'Insert' menu and the 'Objectů' item. For example an Excel worksheet can be embedded within a Word document. Often inserted OLE objects in Word documents are diagrams with little or no text, but this is not always the case and caution is needed. For example, an embedded Excel worksheet may contain significant amounts of text.

Using Microsoft Word to open HTML files and provide statistics needs some additional care. If the HTML file contains a form, with predefined options for a drop down type combo box, then Word will not count the predefined drop down text options. When the HTML contains forms this can lead to the word count being significantly lower then the truth. Also not included in the statistics by Word is the HTML page title, button text, and text in meta tags such as meta tags for description and keywords. In common which scanned images, text that is part of a graphic, very often buttons, will also not be counted.

Microsoft PowerPoint

In common with Word, PowerPoint does not count the text contained in any embedded objects, also known as OLE objects. Use of embedded objects in PowerPoint presentations is common. Microsoft Word tables can be easily inserted as embedded objects, using the PowerPoint Insert menu, Picture sub-menu. Excel worksheets are also commonly embedded into a slide. When embedded objects exist, they typically do contain significant amounts of text and this should be taken into account manually.

PowerPoint 97 and 2000 are not consistent with Word in the rules used for counting. For example, hyphenated words are counted as two words. PowerPoint XP corrects this difference. This means that two different users, with different PowerPoint versions may disagree about the word count on the same document.

PowerPoint of course doesn't provide character count statistics. For this a third party tool must be used.


Understanding the tools available and the shortcomings of different approaches to word and character counting is important. Minor word count differences are probably not worth getting hung up on and a more pragmatic approach is sensible. A few words make little difference to the overall time for translation; far more important is carefully considering the type and difficulty of the material. This of course is an altogether more skilled task!

