The Statistical Properties of Random Bitstreams and the Sampling Distribution of Cosine Similarity

5 Pages Posted: 26 Oct 2012 Last revised: 12 Nov 2012

Graham L. Giller

JP Morgan Chase Bank NA

Date Written: October 25, 2012

Abstract

We summarize the statistical properties of statistics computed from independent random bitstreams including the commonly discussed support and cosine similarity. We derive the moments of the asymptotically normal approximation to the sampling distribution of the cosine similarity of independent random bitstreams and compare those computed moments to those measured by Monte-Carlo simulation. We find agreement for bitstreams of internet scale in length (i.e. of order 10,000 bits) and much smaller (100 and 10 bits) and demonstrate that the expected value of the cosine similarity of independent bitstreams might very significantly distant from zero. To compensate for this bias we propose a new statistic Support Adjusted Cosine Similarity or SACS.

Keywords: collaborative filtering, cosine similarity, random bitstreams, sampling distribution, support, nested binomial distribution, Monte-Carlo simulation, delta method

Suggested Citation

Giller, Graham L., The Statistical Properties of Random Bitstreams and the Sampling Distribution of Cosine Similarity (October 25, 2012). Available at SSRN: https://ssrn.com/abstract=2167044 or http://dx.doi.org/10.2139/ssrn.2167044

Graham L. Giller (Contact Author)

JP Morgan Chase Bank NA ( email )

383 Madison Avenue
New York, NY
United States

Paper statistics

Downloads
222
Rank
111,930
Abstract Views
694