Proximity Matrix in Random Forest

0
32 In mathematics or statistics, a proximity matrix is a square matrix (two-dimensional array) containing the distances, taken pairwise between the elements of a matrix. Broadly defined; a proximity matrix measures the similarity or dissimilarity between the pairs of matrix.

If measured for all pairs of objects in a matrix e.g. driving distances among a set of Indian cities , the proximity matrix tries to estimate how close two cities are on an average in comparison to rest of the cities within matrix.

Proximity values in the matrix represents the similarity among the object of the matrix; higher the value higher the similarity among the objects.

Analogy from Correlation Matrix: An analogy can be taken from correlation matrix. Just like higher value within correlation matrix represents high correlation among variables of regression; similarly high values in proximity matrix represents the similarity in the behaviors of the two object involved.

Symmetric Matrix: Proximity matrices are normally symmetric, so that the proximity of object a to object b is the same as the proximity of object b to object a. Upper half of the matrix would be mirror image of lower half.

Proximity in Random Forest: Proximities are calculated for each pair of cases/observations/sample points. If two cases occupy the same terminal node through one tree, their proximity is increased by one. At the end of the run of all trees, the proximities are normalized by dividing by the number of trees. Proximities are used in replacing missing data, locating outliers, and producing illuminating low-dimensional views of the data.

Let’s say that you have N observations (i.e. rows of data). For every pair of observations, the proximity measure tells you the percentage of time they end up in the same leaf node. For example, if your random forest consisted of 100 trees, and a pair of observations end up in the same leaf node in 80 of the 100 trees. Then the proximity measure is 80/100 = 0.8. The higher the proximity measure, the more similar the pair of observations.