Extended truncated Inverse Gaussian

Abstract

The inverse Gaussian–Poisson mixture model is very useful when modelling highly skewed non-negative integer data in fields as diverse as linguistics, ecology, market research, bibliometry, engineering and insurance. When using this statistical model on the frequency of word or species frequency data, one typically truncates its sample space at zero to accommodate for the ignorance about the number of words or species that are not observed. In this paper, we show that by truncating the sample space of the inverse Gaussian–Poisson model, one is allowed to extend its parameter space and in that way improve its fit when the frequency of one is larger and the right tail is heavier than is allowed by the unextended model. By fitting the extended model to word frequency count data, we find many instances where the maximum likelihood estimates fall in the extension of the parameter space.

Keywords

Distribution of vocabulary Poisson mixture Sichel model species frequency stilometry textual data

Get full access to this article

View all access options for this article.

References

Ajiferuke I , Wolfram D and Famoye F (2006) Sample size and informetric model goodness-of-fit outcomes: a search engine log case study . Journal of Information Science, 32, 212–22 .

Atkinson A and Yeh L (1982) Inference for Sichel's compound Poisson distribution . Journal of the American Statistical Association, 77, 153–58 .

Baayen H (2001) Word frequency distributions. Dordretch: Kluwer .

Böhning D and Kuhnert R (2006) Equivalence of truncated count mixture distributions and mixtures of truncated count distributions . Biometrics, 62, 1207–215 .

Burrell QL and Fenton MR (1993) Yes, the GIGP really does work–and is workable . Journal of the American Society for Information Science, 44, 61–69 .

Carlson M (2002) Assessing microdata disclosure risk using the Poisson-inverse Gaussian distribution . Statistics in Transition, 5, 901–25 .

Chhikara RS and Folks JL (1989) The inverse Gaussian distribution: theory, methodology and applications. New York: Marcel Dekker .

Engen S (1974) On species frequency models . Biometrika, 61, 263–70 .

Giron J , Ginebra J and Riba A (2005) Bayesian analysis of a multinomial sequence and homogeneity of literary style . The American Statistician, 32, 61–74 .

10.

Good IJ (1953) The population frequencies of species and the estimation of population parameters . Biometrika, 40, 237–64 .

11.

Griffiths DA (1973) Maximum likelihood estimation for the beta-binomial distribution and an application to the household distribution of the total of cases of a disease . Biometrics, 29, 637–48 .

12.

Heller G (1997) Estimation of the number of classes . South African Statistical Journal, 31, 65–90 .

13.

Herdan G (1961) A critical examination of Simon's model of certain distribution functions in linguistics . Applied Statistics, 10, 65–76 .

14.

Herdan G (1964) Quantitative linguistics. London: Butterworth .

15.

Holla MS (1966) On a Poisson-inverse Gaussian distribution . Metrika, 11, 115–21 .

16.

Holmes DI (1992) A stylometric analysis of mormon scripture and related texts . Journal of the Royal Statistical Society Series A, 155, 91–120 .

17.

Holmes DI and Forsyth RS (1995) The Federalist revisited. New directions in authorship attribution . Literary and Linguistics Computing, 10, 111–27 .

18.

Hoshino N (2005) Engen's extended negative binomial model revisited . Annals of the Institute of Statistical Mathematics, 57, 369–87 .

19.

Johnson N and Kotz S (1994) Univariate continuous distributions. New York: John Wiley & Sons .

20.

Karlis D (2001) A general EM approach for maximum likelihood estimation in mixed Poisson regression models . Statistical Modelling, 1, 305–18 .

21.

Klugman SA , Panjer HH and Willmot GE (1998) Loss models. From data to decisions. New York: John Wiley & Sons .

22.

Ord JK and Whitmore G (1986) The Poisson-inverse Gaussian distribution as a model for species abundance . Communications in Statistics: Theory and Methods, 15, 853–71 .

23.

Pollatschek M and Radday YT (1981) Vocabulary richness and concentration in Hebrew biblical literature . Association for Literary and Linguistical Computing Bulletin, 8, 217–31 .

24.

Riba A (2002) Estadística i Homogeneïtat d'estil al Tirant lo Blanc (in catalan). Unpublished PhD Thesis, Technical University of Catalonia.

25.

Riba A and Ginebra J (2006) Diversity of vocabulary and homogeneity of literary style . Journal of Applied Statistics, 33, 729–41 .

26.

Sankaran M (1968) Mixtures by the inverse Gaussian distribution . Sankhya, Series A, 30, 455–58 .

27.

Seshadri V (1993) The inverse Gaussian distribution: A case study in exponential families. Oxford: Clarendon Press .

28.

Seshadri V (1999) The inverse Gaussian distribution: Statistical theory and applications. New York: Springer Verlag .

29.

Sichel HS (1971) On a family of discrete distributions particularly suited to represent long-tailed frequency data . In Proceedings of the third symposium on mathematical statistics. (ed. Laubscher NF ), Pretoria: C.S.I.R 51–97 .

30.

Sichel HS (1974) On a distribution representing sentence-length in written prose . Journal of the Royal Statistical Society, Series A, 137, 25–34 .

31.

Sichel HS (1975) On a distribution law for words frequencies . Journal of the American Statistical Association, 70, 542–47 .

32.

Sichel HS (1982a) Asymptotic efficiencies of three methods of estimation for the inverse Gaussian-Poisson distribution . Biometrika, 69, 467–72 .

33.

Sichel HS (1982b) Repeat-buying and the generalized inverse Gaussian-Poisson distribution . Applied Statistics, 31, 193–204 .

34.

Sichel HS (1985) A bibliometric distribution that really works . Journal of the American Society for Information Science, 36, 314–21 .

35.

Sichel HS (1986a) Word frequency distributions and type-token characteristics . Mathematical Scientist, 11, 45–72 .

36.

Sichel HS (1986b) Parameter estimation for a word frequency distribution based on occupancy theory . Communications in Statistics: Theory and Methods 15, 935–49 .

37.

Sichel HS (1992a) Anatomy of the generalized inverse Gaussian-Poisson distribution with special applications to bibliometric studies . Information Processing and Management, 28, 5–17 .

38.

Sichel HS (1992b) Note on a strongly unimodal bibliometric size frequency distribution . Journal of the American Society for Information Science, 43, 299–303 .

39.

Shoukri MM , Asyali MH , vanDorp R and Kelton D (2004) The Poisson inverse Gaussian regression model in the analysis of clustered counts data . Journal of Data Science, 2, 17–32 .

40.

Stein GZ and Juritz JM (1988) Linear models with an inverse Gaussian-Poisson error distribution . Communications in Statistics: Theory and Methods, 17, 557–71 .

41.

Stein GZ , Zucchini W and Juritz JM (1987) Parameter estimation for the Sichel distribution and its multivariate extension . Journal of the American Statistical Association, 82, 938–44 .

42.

Tremblay L (1992) Using the Poisson inverse Gaussian in bonus-malus systems . ASTIN Bulletin, 22, 97–106 .

43.

Willmot GE (1986) Mixed compound Poisson distributions . ASTIN Bulletin, 16, 59–79 .

44.

Willmot GE (1987) The Poisson-inverse Gaussian distribution as an alternative to the negative binomial . Scandinavian Actuarial Journal, 2, 113–27 .

45.

Willmot GE (1988a) Parameter orthogonality for a family of discrete distributions . Journal of the American Statistical Association, 83, 517–21 .

46.

Willmot GE (1988b) A remark on the Poisson Pascal and some other contagious distributions . Statistics and Probability Letters, 7, 217–20 .

47.

Willmot GE (1988c) Sundt and Jewell's family of discrete distributions . ASTIN Bulletin, 18, 17–29 .

48.

Yule GU (1944) The statistical study of literary vocabulary. London: Cambridge University Press .

49.

Zipf GK (1932) Selected studies of the principle of relative frequency in language. Cambridge: Harvard University Press .

Extended truncated Inverse Gaussian–Poisson model

Abstract

Keywords

Get full access to this article

References