Global Journal of Human-Social Science, B: Geography, Environmental Science and Disaster Management, Volume 22 Issue 3

In data science, cluster analysis is part of unsupervised Machine Learning technology that encompasses a set of tools for creating clusters with homogeneous properties from a large number of heterogeneous samples. Data science according to VanderPlas (2016) is difficult to define, but can be considered as an interdisciplinary set of skills that are becoming increasingly important in many applications in industry and academia and comprises the intercession of three distinct overlapping areas: 1 – Math & Statistics Knowledge , to model and summarize data sets; 2 – Hacking Skills , to design and use algorithms to store, process and visualize this data efficiently and, 3 – Substantive Expertise , knowledge necessary to interpret the results. For the application of Data Science it is common to use programming languages such as R and Python (among others) through the implementation of codes and sets of functions (libraries) that allow to manipulate and treat data, as well as generate relevant information about the data in seconds. It is noteworthy that the Python programming language has established itself as one of the most popular languages for scientific computing due to its interactive nature and its system of maturation of scientific libraries, being an attractive choice for the development of algorithms and exploratory data analysis (Millman and Aivazis, 2011). There are several libraries with several modules with different functionalities and are used at different stages of the analysis in Data Science with Python, whose focus varies according to the analyst's objective. Thus, to deepen the knowledge about the main libraries and modules, the following references are recommended (not limited to these): Millman and Aivazis (2011); Pedregosa (2011); Harris et al., (2020) and Virtanen et al., (2020) and the book "Data Science of Zero" by Joel Grus (2016) that present concepts and details on the subject. There are several clustering methods as shown Hastie, Tibshirani and Friedman (2009), Hair et al., (2009), Härdle and Simar (2015), Forsyth (2018), among others. The hierarchical method is the one used in this article, because it is the most frequently applied in practice. Härdle and Simar (2015), indicates that it starts with the best possible structure, calculates the distance matrix for the clusters, and joins the clusters that have the shortest distance. However, it should be emphasized that clustering techniques in general, and especially hierarchical clustering, is an exploratory analysis of data, and different combinations may reveal different characteristics of the data set, as analyzed by Chen et al., (2007). Härdle and Simar (2015) say that cluster analysis can be divided into two fundamental steps: 1 – Choice of proximity measure (each pair of observations is verified as to the similarity of their values) and, 2 – Choice of the cluster creation algorithm (based on proximity measurements, objects are assigned to clusters so that the differences between them become large and observations within the cluster become as approximate as possible). The proximity between the data is measured by a distance or matrix of similarity distances whose components provide the coefficient of similarity or the distance between two points. There is a variety of distance measurements for the various types of data, and for quantitative variables, the most used are the Euclidean (used in this article), generalized/weighted, and Minkowski distance. There is also a variety of cluster linkage methods, the main ones are indicated in the Table 1. Hair et al., (2009) state that, combined with the chosen measure of similarity, the clustering algorithm provides the means to represent the similarity between clusters with multiple members. However, according to Härdle and Simar (2015), Metz (2006) and Frank and Todeschini (1994) there is no "correct" combination of distance measurement and linkage method. Table 1: Linkage methods in cluster analysis Linkage methods Clustering shapes Comment Single linkage Defines the distance between two groups as the shortest distance between an element of one group to an element of the other group, also called the Nearest Neighbour algorithm. Tends to produce large clumps, weakly linked and with little internal cohesion. Complete linkage The distance between two clusters is calculated as the greatest distance between two objects in opposite clusters, also called the Farthest Neighbour algorithm. Tends to produce well separated and small clumps. Average linkage The distance between two clusters is calculated as the average of the distances between all pairs of objects in opposite clusters. It proposes a compromise between the two previous algorithms. Volume XXII Issue III Version I 10 ( ) Global Journal of Human Social Science - Year 2022 © 2022 Global Journals B Clustering of Fine-Grained Tropical Soils using Data Science Tools Applied to their Geotechnical Properties