Curse of Dimensionality: The Curse of Dimensionality refers to the problem of finding patterns in data in a high dimensional space. The more features or dimensions we have, the more data points we need to identify patterns within data.
The reason for above problem is as number of dimensions increases, the volume of the space increases so fast that available data become sparse. This sparsely distributed data becomes problem while we trying to come up with a statistically significant result. To obtain a statistically significant result, the amount of data needed grows exponentially as the number of dimensions grows.
Let us try to understand with an easy example. Say we dropped a coin on a 100 meter line. It would not be difficult to find. We simple walk along the line and finding the coin would need few minutes. Now let`s say we have to find coin in 100 * 100 square meter field, we would certainly need few hours. If we add another dimension and now we have to find the coin in a cube each side of 100 meters. It might take few days to find the coin.
Similarly as the numbers of dimensions increases not only mathematical computation becomes more complex but it becomes time consuming also.
Problems with high dimensional data:
1. Increases the processing time
2. Over fitting
3. Required data size increases exponentially
4. Principal Component Analysis is one of the common methods used to reduce the dimensionality. The idea behind PCA is to find out dimensions which account for most of the variance within data.
Advantages of Dimensionality Reduction:
1. Computational Efficiency
2. Cost associated with collection and storage of huge data
3. Classification problem
4. Ease of interpretation
Here is another interesting example to understand “The Curse of Dimensionality”. Let us think the problem of identifying a statistically significant result as catching an animal and dimensions of data as range of movement of the animal. If we are chasing an animal which can move in 2 dimensions left or right, forward or backward. Now if we are trying to catch a bird which can fly adding another dimension. Bird not only can move Left or right, forward or backward but it can fly making it very difficult to catch. Now let us suppose some mythical time traveling beast that can move in 4 dimensions left or right(x), forward or backward(y), up or down(z), past or future(t). We can see this gets more difficult as the dimensions increase. So we can say the more dimensions we work with, the less effective standard computational and statistical techniques become. This has repercussions that need some serious workarounds when machines are dealing with Big Data.