Abstract
Abstract
Analysis of large data sets is currently a major challenge. Strong efforts are being undertaken to tackle this problem by developing new methods or modifying existing ones. The Z association method is a new method for describing directional association in contingency tables. It allows to arbitrarily group categories for each of the two variables, for which the contingency table is analyzed. The Z coefficient was calculated on a sample data set with gene mutations in different cancer types. Results showed some association with both gene mutations and annotation groups. Detailed results obtained for particular cancer types versus particular genes and annotation groups were in line with well-known facts in cancer genomics. The “MEUSassociation” R library allows to analyze the directional association between two categorical variables, and the mutual relationship is summarized in a contingency table, by means of the Z association coefficient. The method implemented in the library allows to compute the standard Z coefficient and to apply it in a case, where all possible singular coefficients Z(A:B) are computed at the same time, giving information of association between particular rows and columns. Investigating the ranked list of the highest singular coefficients allows to reduce the complexity of a large-scale data set. Both the Z coefficient and its R implementation are important tools in categorical data analysis.
1. Introduction
M
Apart from statistical tests, there are also a number of measures of association calculated for contingency tables. These include the association coefficient C, the phi coefficient, or its extension the Cramer's V coefficient (Cramér, 1946), all being symmetric and based on the chi-squared statistics. Another example is the Goodman–Kruskal's lambda (Goodman and Kruskal, 1979) coefficient measuring the proportional reduction of error rate, which is also asymmetric where one needs to distinguish between independent and dependent variables. There is also another group of rank correlation coefficients applicable for ordinal variables, which include Spearman's rho, Goodman and Kruskal's gamma, Kendall's tau statistics, and Somers' d (Kendall, 1938; Somers, 1962; Goodman and Kruskal, 1979).
The Z coefficient described in this article belongs to the category of association coefficients. It has a purely probabilistic definition and is an asymmetric measure of association. It also coincides with Cramer's V coefficient in the case of n × 2 tables. The Z coefficient was successfully applied to large a data set analysis to determine connections between the structure of proteins and their biological function (Meus et al., 2006). These results appeared to be aligned with the entropy-based method (Brylinski et al., 2005). The Z association measurement was also used for comparative analysis of tandemly repeated trinucleotides in the human genome (Piwowar et al., 2006).
In this article, the Z coefficient was used to determine the association between different cancer types and different types of mutations in two ways: using a previously prepared and analyzed data set (Kandoth et al., 2013) and using an original data set with the inclusion of additional information of processes in which genes are taking part.
2. Methods
2.1. Data set
A sample data set was taken from Kandoth et al. (2013) and consisted of information related to mutated genes (with point mutations and small insertions/deletions) from 3281 tumors across 12 cancer types. Analyzes were performed on two sets:
12 cancer types and genes; 12 cancer types and annotated gene groups.
Twelve cancer types: breast adenocarcinoma (BRCA), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), uterine corpus endometrial carcinoma (UCEC), glioblastoma multiforme (GBM), head and neck squamous cell carcinoma (HNSC), colon and rectal carcinoma (COAD, READ), bladder urothelial carcinoma (BLCA), kidney renal clear cell carcinoma (KIRC), ovarian serous carcinoma (OV), and acute myeloid leukemia (LAML; conventionally called AML).
Annotated group of genes (cellular processes in which groups of mutated genes are involved): transcription factors/regulators, histone modifiers, genome integrity, receptor tyrosine kinase signaling, cell cycle, mitogen-activated protein kinase (MAPK) signaling, phosphatidylinositol-3-OH kinase (PI(3)K) signaling, Wnt/β-catenin signaling, histones, ubiquitin-mediated proteolysis, splicing, and other.
2.2. Z coefficient methodology
Given two events A and B, one could ask how much the knowledge of B helps to determine the occurrence of A. One way to measure it is to look at the ratio of error rates: one with the knowledge of B and the other without it:
The above definition (1) can be interpreted as a value of one less than the averaged product of the family of error rate ratios calculated either with or without the knowledge of B. This definition can also be seen as a generalization of the Pearson correlation coefficient for two categorical variables in the following way. Having two numeric, binary variables X and Y, one can calculate the Pearson correlation coefficient. Its value is equal to the Z association coefficient calculated for the contingency table summarizing mutual relationship between X and Y.
Having two generic partitions A1, A2, …, Ak and B1, B2, …, Bn, one can define k × n contingency table 
The above-defined Z association coefficient (2) has the following characteristics:
ranges between 0 and 1; is equal to 0 in the case of independent variables (when entries in the contingency table are determined by marginal counts, or more precisely is equal to 1 in the case of maximal dependency, that is, where each Bj determines only one possible value for some Ai (in each column there is only one positive entry); is not symmetric (switching roles of partitions A1, A2, …, Ak and B1, B2, …, Bn, except the 2 × 2 tables case, generally leading to different results); is not monotonic (grouping particular columns or rows can both increase and decrease the value of the coefficient).
The Z association coefficient is also equal to Cramer's V coefficient (Cramér, 1946) in the case of n × 2 tables.
2.3. MEUSassociation R library
The MEUSassociation R package implements the above Z association coefficient. In the latest version 0.4, it provides the following functionality:
The package also provides the following example data:
-> Reactome (https://reactome.org) -> KEGG (http://www.genome.jp/kegg)
The MEUSassociation package is freely available on GitHub. The library, installation instructions, full documentation, and test data sets are available at https://github.com/mpiwowar/MEUSassociation.git. “MEUSassociation” runs under R, and does not require any additional libraries.
Executing the following short code allows one to get the results presented in the article:
library(MEUSassociation) data(“cancer_mutations”) z_coefficient(cancer_mutations) data(“cancer_mutations_gene_groups”) z_coefficient(cancer_mutations, row_groups = cancer_mutations_gene_groups) head(z_coefficient_ranks(cancer_mutations)) head(z_coefficient_ranks(cancer_mutations, col_groups = cancer_mutations_gene_groups))
3. Results
The Z coefficient method was used for a data set summarizing 12 different cancer types and mutated genes (Kandoth et al., 2013). The analysis was also repeated with genes grouped according to biochemical processes they are involved in. All calculations were made using the MEUSassociation R library.
The resulting Z association coefficient value of 0.34 suggests that there is some association between cancer types and gene mutations. When taking into account different gene annotation groups, the calculated Z association coefficient was equal to 0.18, suggesting that there is also some association between cancer types and biochemical processes, in which particular genes are active (Table 1). It should be noted that one should be careful when comparing Z coefficient values, especially for different table sizes, as the distribution of this coefficient is not well understood yet and it might tend to have higher or lower values depending on the table size.
Z coefficient Values Calculated for Different Cancer Types and Genes, and for Different Cancer Types Versus Gene Annotation Groups
The analysis was further extended by investigating the association between each particular cancer type and gene (Fig. 1).

The Z coefficient indicating the strength of association between cancer types and genes depicted on the map with color gradation from white (minimum value) to red (maximum value).
The above shows that when looking at cancer types and particular genes, the strongest association exists between KIRC and von Hippel–Lindau tumor suppressor with the Z coefficient value of 0.47 (Table 2).
Highest Z Coefficients Calculated for Particular Genes and Cancer Types
AML, acute myeloid leukemia; COAD.READ, colon and rectal carcinoma; KIRC, kidney renal clear cell carcinoma.
A similar extended analysis was done in the case of association between cancer types and gene annotation groups (Fig. 2).

The Z coefficient indicating the strength of association between cancer types and gene annotation groups depicted on the map with color gradation from white (minimum value) to red (maximum value).
The analysis shows the strongest (compared with other results) association between a combined group of the colon (COAD) and the rectal (READ) tumors (COAD.READ) and Wnt beta-catenin signaling pathway (Table 3).
Highest Z Coefficients Calculated for Particular Gene Annotation Groups and Cancer Types
OV, ovarian serious carcinoma; UCEC, uterine corpus endometrial carcinoma.
Literature provides strong evidence that the Wnt beta-catenin signaling pathway is very important in the READ cancer mechanism (Jung et al., 2015; Kramer et al., 2017).
4. Conclusion
Recent technological advances in molecular biology and other fields have given rise to numerous large-scale data sets. Analysis of such data sets imposes serious methodological challenges due to the usual large size and complex structure. The Z association coefficient is a tool giving valuable insight into analysis of such data sets.
“MEUSassociation” R library implements the Z association coefficient and allows to calculate it while grouping categories for each of two variables in an arbitrary way. In addition, the library allows for calculating the Z coefficient for contingency tables, evaluating the association between each particular column and row (or groups of columns and rows) while taking into account observations from the whole contingency table. These results can also be presented as a ranked list, allowing to determine row/column pairs with the highest association. It allows to reduce the complexity of high-volume data and to concentrate on the specific aspect.
Footnotes
Author Disclosure Statement
The authors declare that no competing financial interests exist.
