The Statistical Properties of Random Bitstreams and the Sampling Distribution of Cosine Similarity
5 Pages Posted: 26 Oct 2012 Last revised: 12 Nov 2012
Date Written: October 25, 2012
We summarize the statistical properties of statistics computed from independent random bitstreams including the commonly discussed support and cosine similarity. We derive the moments of the asymptotically normal approximation to the sampling distribution of the cosine similarity of independent random bitstreams and compare those computed moments to those measured by Monte-Carlo simulation. We find agreement for bitstreams of internet scale in length (i.e. of order 10,000 bits) and much smaller (100 and 10 bits) and demonstrate that the expected value of the cosine similarity of independent bitstreams might very significantly distant from zero. To compensate for this bias we propose a new statistic Support Adjusted Cosine Similarity or SACS.
Keywords: collaborative filtering, cosine similarity, random bitstreams, sampling distribution, support, nested binomial distribution, Monte-Carlo simulation, delta method
Suggested Citation: Suggested Citation