Write a Java program named WordCount that explores Zipf's Law as it applies to textual data by reading (scanning) text input from the standard input, counting the number of occurrences of each input word, and reporting the 100 top-ranked words to the standard output.
Your program can be based on the WordCount program given on p. 388 of the textbook, but must be revised to use the interfaces and classes of the Java Collections Framework instead of the net.datastructures classes. In particular, you should write your program to the Java Map interface, appropriately parameterized to do word counts.
You should ensure that your program is written to the Map interface by testing it with two implementations of that interface — the HashMap and TreeMap implementations from the Java Collections Framework — by changing only one line (actually, half a line) of your program. Time your program (using the Unix time command) using each implementation; which one gives faster results?
Ensure that your method for getting the word-frequency pairs into the correct ranked order by frequency operates in time O(n log n), not O(n2).
The Javadoc documentation for the following Java interface and collection classes are provided here for your convenience:
Arrays — a collection of algorithms pertaining to arrays, including sort
Collection — the generic interface for all the Java collection class implementations
Comparator — you may need to write a class that implements this interface in order to sort the word-frequency pairs according to frequency of occurrence
HashMap — a hash-table implementation of the Map interface
Map.Entry — an interface for the entries in a Map
Map — the interface for the map ADT
Set — the interface for the set ADT, which is useful in this context because the result returned by the entrySet() method on a map is a set
The full set of documentation for the Java API's is available at:
You may need to create a class that implements the parameterized Map.Entry interface, each object containing a word and its associated word count, in order to be able to instantiate an array of word-frequency entries retrieved from the map.
You may also need to implement the Comparator interface (appropriately parameterized) to provide a customized comparator that the Arrays.sort() method can use to sort the entries by frequency.
There are many sources for text input for your program. One example, which is conveniently and freely available in plain text format, is the KJV Bible, from Project Gutenberg:
The Bible, King James Version —
available both in plain text (4.24 MB) and zipped plain text (1.34 MB)
Please hand in a hardcopy listing of your solution to this assignment (to ensure that the formatting which you carefully prepared is preserved) and submit the source for your solution using the following form. You may use the same form to submit multiple files if your solution involves more than one file. You may also submit a solution multiple times if you wish to correct errors in previous submissions. Your last submissions will be accepted for grading.
Assignments will be accepted after the due date and time, but late submissions will be assessed a penalty of 1% per hour or portion of an hour.
Your assignment submissions are password-protected. These passwords apply only to form data submitted via the Web server. They are separate from (and typically different from) your network password or the password for your Unix account. To change your Web password, press the following button:
Copyright © 2005 Jonathan Mohr