The content/function word rule was motivated by sentences such as ``The tall man | was walking | down the street'' (f c c f c c f c) where a break occurs after the first noun phrase. While many verb constructions in English contain auxiliaries (function words), constructions such as the simple past do not, and in sentences such as ``The tall man walked down the street'' (f c c c c f c) this rule cannot place a break in the same place. Such cases can only be tackled by using a larger tagset.
To ensure good performance, the POS sequence models must be trained on a sufficient number of examples. There is an inevitable trade-off between robustness, achieved by having a large number of each word sequence in the training data, and discriminative ability, achieved by allowing a large tagset that can model individual effects. While a tagset of K=3, V={function, content,punctuation}) is too small, very large tagsets prevent accurate estimation as the number of possible types of word sequence is KL (i.e. it grows exponentially on the POS sequence window size). Finding the best possible tagset (one which genuinely optimises the performance) is a difficult problem and we have not attempted a full solution. Rather we approached the problem empirically and report results from a series of experiments using various tagsets.
All the experiments here were performed on data tagged with the HMM tagger. It might have been interesting to perform a comparison between the system trained on these tags and the original ones in the corpus which were tagged by hand. Unfortunately the MARSEC corpus has a different tagset to that of the POS tagger training data and so there was no easy way to perform a rigorous comparison without introducing artifacts from the mapping of one tagset to the other. However, informal experiments seem to show that the automatic algorithm performs as well (if not better) on automatically tagged data as on hand tagged data.
Figure 2 shows results of an experiment measuring how performance varies as a function of tagset size. New tagsets were formed by using a greedy algorithm to collapse categories in the original tagset. For each stage in the process, we found which combination of two current clusters gave the best performance and chose that as the tagset for the next stage. We cannot claim that the scores for each tagset are the best possible for a tagset of that size as there are many other possible ways to make tagset groups from the original set. However, from evidence from this and other experiments conducted on tagset reduction, where clustering was more linguistically directed, we believe this experiment shows the general trend of performance against tagset size. Again, no single measure can be taken as the sole indicator of performance, but the best results seem to be when the tagset size is somewhere between 15 and 25. From more detailed analysis involving all the types of phrase models, we found that a tagset of size 23 gave the best overall performance. This tagset is linguistically the most easy to describe: the distinctions between subtypes of the four major categories (nouns, verbs, adjectives and adverbs) were ignored, combining the four basic noun tags into a single category and likewise for the 6 verbial, 3 adjectival, and 4 adverbial tags. All punctuation was grouped as one tag. This tagset of 23 was used in most of the subsequent experiments.