Word counting using a binary tree dictionary

Introduction

In "The C Programming Language", the authors present an example program that counts unique words in a text. Such a program needs to maintain a list of every unique word encountered, and for each word, an associated count of the number of occurrences.

We can design the dictionary in many different ways; a linear list, a binary tree, a balanced tree, a relational database, etc. This program implements the dictionary as a binary tree.

The Code

struct WordTree

The struct WordTree defines the layout of a single node in our dictionary tree. As the program will encounter words in a random sequence (that is, not alphabetically ordered), the data structure must be able to accommodate not only the word, but words that may occur alphabetically earlier or later than the word. The phrase "in the goodness of ghu we trust" would result in a tree that looks like

             "in"
            /    \
  "goodness"      "the"
    /    \       /     \
"ghu"     .   "of"     "we"
 / \          / \      /   \
.   .        .   .  "trust" . 




In our struct WordTree, the struct WordTree *below entity represents the left link in our tree, connecting this word to a word, encountered later in the data input, that is alphabetically earlier than the word that the node represents.

Similarly, the struct WordTree *above entity represents the right link in our tree, connecting this word to a word, encountered later in the data input, that is alphabetically later than the word that the node represents.

In struct WordTree, the char *word entity holds a pointer to a string that contains the word that this node represents. Additionally, the unsigned long used entity contains a count of the number of times that this word occurs in the text.

In this part of the code, we also define a macro (WORDCHUNK), that we use to provide the size of the chunk of memory (and the increment to that size) that we use to hold the word in.

struct WordTree {
  struct WordTree *below, /* words that come before this word */
                  *above; /* words that come after this word  */
  char *word;
  unsigned long used;
};


#define WORDCHUNK 9


#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>

GetWord()

The GetWord() function reads through the standard input until it either gathers a "word" together, or reaches end-of-file. For end-of-file, the function will return NULL, signalling to the caller that a "word" was not gathered. Otherwise, it returns a pointer to an malloc()'ed space in which the function has gathered the characters that make up the "word".

A "word" consists of a sequence of one or more alphabetic characters. The function first skips all leading non-alphabetics other than EOF; it will return NULL if it encounters EOF here. Since we haven't yet gathered a word together, EOF at this point means that we really have run out of data before encountering the next word, and we can safely return NULL to indicate this condition.

At this point, we know that we have encountered a "word" (as we have defined it), but we don't know if we've seen (and recorded) this word before. So, no matter what, we want to collect the word in it's entirety.

One of the things that we don't know about the word yet is it's length. We have to allocate some memory to store the word in, but we don't know how much memory will be needed. So, we use malloc() to grab a fixed amount of memory (a WORDCHUNK sized amount), with the intent of expanding this allocation if we need to. Of course, if we cannot allocate this initial amount of memory, we've got a big problem, and we return NULL to the calling function.

Once we have space for our word, we start filling in the text of it. Starting with the character that got us out of the "skip initial non-alphabetics" loop, we see if there is enough space in the allocated chunk to stash the character (and a terminating \0). If we don't have enough space, we reallocate the block to be WORDCHUNK bytes larger, and then stash the character into the now-enlarged block. Then, we get the next character, ensure that it is an alphabetic, and stash it as well.

We continue this loop until we either run out of alphabetics, or we run out of memory to store in. Running out of memory is a severe error that we cannot recover from. If we encounter this condition, we terminate the entire word-gathering process and return NULL. On the other hand, running out of alphabetics is an expected (and wanted) condition, and we handle it by terminating our copy with an end-of-string character, and ungetting the non-alphabetic character.

Now, we have a block of memory (a multiple of WORDCHUNK long) that contains an entire alphabetic string, including the requisite end-of-string character. We terminate this function by returning a pointer to this block back to the calling function.

/*
** GetWord(input): return a char pointer to the next word (as a string)
** Accepts: a FILE * for the input stream
** Returns: a char *
**      On Success: the return value will point to a nul-terminated string
**      On Failure: the return value will be set to NULL
*/ 
char *GetWord(FILE *input)
{
  char *temp = NULL;
  int byte,
      count,           /* counts chars used in current allocation */
      wordsize;        /* size of current allocation in chars     */

  /* skip leading non-alphabetics */
  for (byte = fgetc(input); !isalpha(byte); byte = fgetc(input))
  {
    if (byte == EOF)
      return NULL;
  }

  /* allocate initial word storage space */
  if ((temp = malloc(wordsize = WORDCHUNK)) == NULL)
    return NULL;

  /* start stashing bytes. stop when we run out of alphabetics */
  for (count = 0; isalpha(byte); byte = fgetc(input))
  {
    if (count == wordsize-1)
    {
      char *temp2;

      if ((temp2 = realloc(temp,wordsize+=WORDCHUNK)) == NULL)
        return NULL;
      else
        temp = temp2;
    }
    *(temp+count++) = byte;
  }

  /* nul-terminate the word, to make it a string */
  *(temp+count) = 0;

  /* put the non-alphabetic back */
  ungetc(byte,input);

  return temp;
}

AddWord()

The AddWord() function adds a given word (specified by a pointer to a '\0'-delimited string) to the tree (specified by a pointer to a pointer to the root node of the tree). The function will search the tree for a node that contains a match to the word. If it fails to find a match, the function will insert a new node containing the given word into the appropriate point in the tree. The function will return either 0 or 1.

The function first allocates a block of memory in which to build a node for the new word. If it cannot obtain such a block, the function immediately terminates, returning a value of 0 to the calling function.

Once it has a struct Wordtree-sized block of memory, the function then initializes it's fields to represent a single node with no children. This node contains a pointer to the word that the node will represent, and indicates an initial count of one usage of the word.

Next, the function will try to place this structure in the proper spot in the tree.

/*
** AddWord(text, base): add a word to the word tree
** Accepts: a char * pointing to the string to be added
**      a struct WordTree ** pointing to the struct WordTree * where
**        the address of the root node is stored.
** Returns: an int indicating the number of new words added with this invocation
**        (will be 0 or 1)
*/ 
int AddWord(char *text, struct WordTree **base)
{
  struct WordTree *temp;
  int direction;

  /* allocate space for our new word node - return 0 if no more space */
  if ((temp = malloc(sizeof *temp)) == NULL)
    return 0;

  /* fill in the word node values */
  temp->below = temp->above = NULL;
  temp->word = text;
  temp->used = 1;


  /* put the word node into the proper place in the word tree */
  if (*base == NULL)    /* no words recorded yet */
    *base = temp;   /* so make this node the root node */
  else
  {
    struct WordTree *prev = NULL,
                    *this = *base;
    int direction;

    /* find the node that should point to this node */
    while (this != NULL)
    {
      prev = this;

      if ((direction = strcmp(this->word, temp->word)) == 0)
      {
    /* word already exists in the tree - just count it, return 0 added */
        this->used += 1;
        free(temp);
        return 0;
      }
      else
        if (direction < 0)  /* new word is above this word */
          this = this->above;
        else            /* new word is below this word */
          this = this->below;
    }

    /* this == NULL, and we update the prev node */
    if (direction < 0)
      prev->above = temp;
    else
      prev->below = temp;
  }
  return 1;
}

DelWord()

The DelWord() function will recursively delete a specified struct WordTree node and any nodes linked to it.

As a sanity check, it returns without performing any action if the node to be released is NULL.

If the node is not NULL, DelWord() calls itself recursively, passing the address of the subtree on the below side to this new invocation. This causes the new invocation to release all the below nodes first.

Next, if the trace parameter is not zero, DelWord() prints both the word and used count for the current node.

DelWord() then calls itself recursively, passing the address of the subtree on the above side to this new invocation. This causes the new invocation to release all the above nodes next.

Finally, DelWord() calls the standard library free() function on the word and then on the root node, to release the memory back to the freepool.

void DelWord(struct WordTree *root, int trace)
{
  if (root == NULL) return;

  DelWord(root->below,trace);
  if (trace) printf("Word \"%s\" was used %lu times\n",root->word,root->used);
  DelWord(root->above,trace);
  free(root->word); free(root);
}

AllWords()

The AllWords() function ties the three previous functions (GetWord(), AddWord(), and DelWord()) into a single functional whole. AllWords() accepts a pointer to an open FILE, and a trace flag, and processes all the words found in the file. It uses GetWord() to retrieve single words from the FILE, AddWord() to add the word to the dictionary tree (and update a count of the number of words in the dictionary), and, once it has exhausted the FILE of all it's words, DelWord() to clean up the dictionary tree and report each word's usage count.

The function starts with defining three variables:

  1. root is a pointer to a struct WordTree, initialized to NULL,
  2. a_word is a pointer to character, and
  3. wordcount is an unsigned long integer, initially set to zero

The function performs a loop, retrieving a pointer to the next word (as returned by GetWord()) into the a_word pointer. If NULL, the loop terminates. Otherwise, the function adds the word pointed to by a_word to the dictionary tree (pointed to by root) using the AddWord() function. The return value from AddWord() (an integer indicating how many words (0 or 1) AddWord() added to the tree with this invocation) is added to the wordcount, giving us a running total of the number of unique words in the dictionary.

Once we stop looping, wordcount will contain a final total of the number of unique words in the dictionary. At this point, the function uses the DelWord() function to tear down the dictionary tree, possibly generating a report of the usage count of each individual word as it goes.

Finally, AllWords() returns the wordcount to the caller.

unsigned long AllWords(FILE *input, int trace)
{
  struct WordTree *root = NULL;
  char *a_word;
  unsigned long wordcount = 0;

  while ((a_word = GetWord(input)) != NULL)
    wordcount += AddWord(a_word, &root);

  DelWord(root,trace);

  return wordcount;
}

usage()

The usage() utility function simply prints a program usage instruction on stderr. The function takes as input parameters a pointer to the character string name of the program (as extracted and saved by main()), and a pointer to a character string containing an additional text to be written to stderr.

First, the function fprintf()'s the usage instruction, with the program name as a parameter.

Next, if the text pointer is not NULL, the function again calls fprintf() to print the additional text string to stderr

Finally, the function returns EXIT_FAILURE to the caller.

int usage(char *name, char *text)
{
  fprintf(stderr,"Usage: %s [[-v] filename]\n",name);
  if (text != NULL) fprintf(stderr,"       %s\n",text);

  return EXIT_FAILURE;
}

main()

The main() function ties all the other functions together into a coherent whole. main() parses the command parameters, and executes the dictionary processing on a specified file.

main() takes two parameters:

  1. an integer count of the number of parameters, argc, and
  2. an array of pointers to char argv, each pointer pointing to a single command parameter

The end-user may invoke this program (which all this code implements) in several ways:

  • without any command parameters at all, which will cause the program to accept it's input from stdin, and report only the number of unique words, or
  • with one command parameter, -v, which will cause the program to accepts it's input from stdin, and report counts by unique word, and the number of unique words, or
  • with one command parameter, not -v, which will cause the program to accepts it's input from a file of the same name, and report the number of unique words, or
  • with two command parameters, -v and a filename, which will cause the program to accepts it's input from the named file, and report counts by unique word, and the number of unique words

The first part of main() defines two variables:

  1. input, which is a pointer to a FILE, and
  2. trace, which is an unsigned integer initialized to zero

main() then parses it's input parameters:

If argc is 1, then there were no command parameters
so main() sets trace to 0, and input to stdin (which is also a pointer to a FILE, initialized to reference the standard input stream)
Otherwise, if argc is 2 and argv[1] points to a string "-v", then the single command parameter indicates a verbose output from stdin,
so main() sets trace to 1, and input to stdin (which is also a pointer to a FILE, initialized to reference the standard input stream)
Otherwise, if argc is 2 and argv[1] does not point to a string "-v", then the single command parameter indicates a terse output from a named file,
so main() sets trace to 0, and input to to the results of the fopen() call for the named file (if the fopen() fails to open the file, the logic calls the usage() function, and returns it's return value to the system)
Otherwise, if argc is 3 and argv[1] points to a string "-v", then the two command parameters indicates a verbose output from a named file,
so main() sets trace to 1, and input to the results of the fopen() call for the named file (if the fopen() fails to open the file, the logic calls the usage() function, and returns it's return value to the system)
Otherwise, the program was given invalid parameters,
so the function calls the usage() function, and returns it's return value to the system)

If we get past all the command argument parsing, the main() function invokes AllWords() with the input FILE pointer and the trace integer, and prints the resulting number to stdout through a fprintf() call.

Next, if input does not point at stdin, we fclose() the input file.

Finally, we return EXIT_SUCCESS to the system, indicating that the program ran to completion.

int main(int argc, char *argv[])
{
  FILE *input;
  unsigned trace = 0;       /* terse */

  /* argc  action
  **    1  read from stdin, terse report
  **    2  argv[1] == "-v"
  **         read from stdin, verbose report
  **       argv[1] != "-v"
  **         read from file named by argv[1], terse report
  **    3  argv[1] == "-v"
  **         read from file named by argv[2], verbose report
  **       argv[1] != "-v"
  **         usage error
  **   >3  usage error
  */


  switch (argc)
  {
    case 1:
      trace = 0;
      input = stdin;
      break;

    case 2:
      if (strcmp("-v",argv[1]) == 0)
      {
        trace = 1;
        input = stdin;
      }
      else
      {
        trace = 0;
        if ((input = fopen(argv[1],"r")) == NULL) 
          return usage(argv[0],"Input file not found");
      }
      break;

    case 3:
      if (strcmp("-v",argv[1]) == 0)
      {
        trace = 1;
        if ((input = fopen(argv[2],"r")) == NULL)
          return usage(argv[0],"Input file not found");
      }
      else
        return usage(argv[0],NULL);
      break;

    default:
      return usage(argv[0],NULL);
      break;
  }

  printf("There were %lu words\n",AllWords(input,trace));
  if (input != stdin) fclose(input);

  return EXIT_SUCCESS;
}

Sample Execution

The following log shows an execution of the program in it's various modes

$ # Some text for the program to play with
$ cat testwords
this is a test a test this is
not
 
$ # Input from stdin, no other options
$ words <testwords
There were 5 words
 
$ # Input from stdin, with verbose reporting
$ words -v <testwords
Word "a" was used 2 times
Word "is" was used 2 times
Word "not" was used 1 times
Word "test" was used 2 times
Word "this" was used 2 times
There were 5 words
 
$ # Input from file, no other options
$ words testwords
There were 5 words
 
$ # Input from file, with verbose reporting
$ words -v testwords
Word "a" was used 2 times
Word "is" was used 2 times
Word "not" was used 1 times
Word "test" was used 2 times
Word "this" was used 2 times
There were 5 words
 
$ # Input from file, file not found
$ words nosuchfile
Usage: words [[-v] filename]
Input file not found
 
$ # malformed command
$ words -q dfsfsdfsdf
Usage: words [[-v] filename]
$

AttachmentSize
File words.c5 KB
Page: