Association between nominal variables – Why can be friends?
The analysis to study the relationship between two nominal variables is often based on the study of the chi-square coefficient. It is one of the best known and widely used analyses in any scientific discipline. This analysis is based on several assumptions, but one of the most important is associated with the expected joint frequencies, according to which no more than 20% of the cells should have an expected frequency of less than 5 cases. When this assumption is not met, the test loses power. When analyzing correlations with this coefficient, a number of nuances must be taken into account:
- If the table is a 2×2 table, the Yates’ correction for continuity should be used.
- If the sample size is very small –less than 30 cases– and a 2×2 table is available, Fisher’s exact test should be used.
Since the chi-square coefficient is not bounded (its value ranges from 0 to infinity), it is not possible to study the strength of the correlation, although residual analyses allow us to study the concentration of cases in the cells of the contingency table.
There are other coefficients that make it possible to study the strenght of the association between two nominal variables:
Name | Use | Range of values | Interpretation | Precautions |
Phi Coefficient | 2×2 tables | Between -1 and 1 | If it approaches -1 the data are grouped in the main diagonal of the contingency table. If it approaches 1, the data are grouped in the secondary diagonal of the contingency table. | – |
Contingency coefficient | Tables of any size | Between 0 and 1 | As it approaches 1, the relationship is more intense | Its maximum value depends on the size of the table and its maximum can be estimated thanks to the Cmax. To compare tables of different dimensions, the Pawlik’s correction, which ranges from 0 to 1, can be used. |
Cramer’s V coefficient | Tables of any size | Between 0 and 1 | As it approaches 1, the relationship is more intense | – |
Tschuprow’s T-coefficient | Tables of any size | Between 0 and 1 | As it approaches 1, the relationship is more intense | – |
Goodman-Kruskal Lambda | Tables of any size | Between 0 and 1 | As it approaches 1, the relationship is more intense | It requires knowing which is the independent variable and which is the dependent variable. |
Yule’s Q coefficient | 2×2 | Between -1 and 1 | As it approaches 1, the odds ratio will be greater than 1. As it approaches -1, the odds ratio will be less than 1. | It requires knowing which is the independent variable and which is the dependent variable. |
Uncertainty coefficient (Theil’s U) | 2×2 | Between 0 and 1 | Reflects the proportional reduction in error when the values of one variable are used to predict the values of the other variable. | It requires knowing which is the independent variable and which is the dependent variable. |
If nominal variables change over time… follow them closely!
If you have a nominal variable measured over multiple time periods and you want to study whether there are changes in the distribution of its modalities, there are a series of techniques that allow you to study whether there are significant variations:
Name | Use | Precautions | Interpretation |
McNemar’s test | One variable with two modalities (assessment at two different points in time) | These tests have a series of assumptions that must be monitored in order not to lose statistical power. | If the test is significant, there will be changes associated with the passage of time in the frequency distribution of the variable being analyzed. |
Bowker’s test | One variable with three or more modalities (assessment at two different points in time) |
Cochran’s Q test | One variable with two modalities (assessment at three or more different points in time) | In case the test is significant, multiple comparisons should be made adjusting the significance level |
A word about odds ratios and the Relative Risk Ratio
In order to understand the concept of odds ratio, it is necessary to understand the following concepts associated with it:
- Probability: it is a measure of how likely it is that a phenomenon or event will occur. This value ranges from 0 to 1, with 0 meaning that the event cannot happen and 1 meaning that the event is certain to occur. For example, if the probability of improvement is 0.60, the probability of no improvement would be 0.40.
- Odds: it is the probability of an event happening divided by the probability of it not happening.
Odds range from 0 to infinity and can be calculated for the occurrence of the event as well as for the non-occurrence of the event. In the example, there would be two odds: the odds of improvement would be 0.60/0.40=1.5 and the odds of no improvement would be 0.40/0.60=0.667. These are interpreted as ratios, i.e., the number of times something can happen over the number of times it cannot happen. In this case, the patient is more likely to improve.
Odds ratios (OR) are the quotient between the two odds and also range from 0 to infinity, but how is an OR interpreted in a basic way?
- When the OR is 1, it indicates there is no association between the variables.
- Values below 1 indicate a negative association between the variables and values greater than 1 indicate a positive association between the variables.
- The further the OR is from 1, the stronger the relationship.
In the case of the Relative Risk Ratio, it is defined as the quotient of the odds of having the disease or presenting the outcome of interest if the predictive factor –i.e., the risk factor– is present or absent.
To interpret relative risk, let us consider an RR of 2. This result means that the risk in one group is twice as high as in the other group. If it is equal to 1, it is equal for both groups and if it is less than 1, the risk is higher for the other group.
The main difference with the OR is that the Relative Risk Ratio is used primarily in the assessment of prospective studies, while the OR is used primarily in the analysis of retrospective studies.
Can anything else be done with nominal variables?
Of course! The analysis of nominal variables can go beyond studying their descriptive behavior and their bivariate relationship. Some examples of possible analyses are:
- Log-linear models: if you want to analyze the relationship between three or more nominal variables.
- Classification machine learning algorithms: if you want, for example, to predict the probability of a patient recovering (or not) based on a series of clinical variables, these algorithms are very useful.
- Text analysis and open-ended responses: analyses linked to Natural Language Processing are used for text analysis, for example, of open-ended responses given during an interview with patients to determine their health status.