Information management
Cluster analysis of households characterized by categorical indicators
Name and surname of author:
Hana Řezanková, Tomáš Löster
Keywords:
cluster analysis, number of clusters, qualitative variables
DOI (& full text):
Anotation:
In the paper we deal with evaluation of the results of cluster analysis which is applied to data files in which objects are characterized qualitative variables. We describe methods of clustering, determination of optimal cluster numbers, and evaluation of obtained clusters implemented in the procedure for two-step cluster analysis in the SPSS statistical software package. These techniques are applied to the selected household indicators gathered in the SILC (Statistics on Income and Living Conditions) survey in the Czech Republic in 2008.
We clustered households characterized by the indicators expressing if a household owns a computer and a car as an example. We discuss the problem of determination of optimal cluster numbers by the approach based on information criteria (we use the Bayesian information criterion) and determine number of clusters by means of the silhouette coefficient. Then we describe four obtained clusters on the basis of indicators of working activity, degree of education and degree of urbanization. Moreover, we extended characterizing variables to the recoded indicators expressing how the household goes well with its income. On the basis of this example we illustrate investigation of variable importance. In this case we describe obtained three clusters by three variables used in the analysis.
In conclusion we mention some other approaches to evaluation of clustering objects characterized by categorical variables. They consist in both coefficients based on multivariate analysis of variance with using specialized variability measure for nominal and ordinal data, and
modification of some other coefficients for qualitative data. The problem of mixed type variables is also mentioned.
In the paper we deal with evaluation of the results of cluster analysis which is applied to data files in which objects are characterized qualitative variables. We describe methods of clustering, determination of optimal cluster numbers, and evaluation of obtained clusters implemented in the procedure for two-step cluster analysis in the SPSS statistical software package. These techniques are applied to the selected household indicators gathered in the SILC (Statistics on Income and Living Conditions) survey in the Czech Republic in 2008.
We clustered households characterized by the indicators expressing if a household owns a computer and a car as an example. We discuss the problem of determination of optimal cluster numbers by the approach based on information criteria (we use the Bayesian information criterion) and determine number of clusters by means of the silhouette coefficient. Then we describe four obtained clusters on the basis of indicators of working activity, degree of education and degree of urbanization. Moreover, we extended characterizing variables to the recoded indicators expressing how the household goes well with its income. On the basis of this example we illustrate investigation of variable importance. In this case we describe obtained three clusters by three variables used in the analysis.
In conclusion we mention some other approaches to evaluation of clustering objects characterized by categorical variables. They consist in both coefficients based on multivariate analysis of variance with using specialized variability measure for nominal and ordinal data, and
modification of some other coefficients for qualitative data. The problem of mixed type variables is also mentioned.
Section:
Information management
Appendix (online electronic version):