Benford’s law, Zipf’s law, and the Pareto distribution

What's new

A remarkable phenomenon in probability theory is that of universality – that many seemingly unrelated probability distributions, which ostensibly involve large numbers of unknown parameters, can end up converging to a universal law that may only depend on a small handful of parameters. One of the most famous examples of the universality phenomenon is the central limit theorem; another rich source of examples comes from random matrix theory, which is one of the areas of my own research.

Analogous universality phenomena also show up in empirical distributions – the distributions of a statistic $latex {X}&fg=000000$ from a large population of “real-world” objects. Examples include Benford’s law, Zipf’s law, and the Pareto distribution (of which the Pareto principle or 80-20 law is a special case). These laws govern the asymptotic distribution of many statistics $latex {X}&fg=000000$ which

  • (i) take values as positive numbers;
  • (ii) range over many…

View original post 4,602 more words

Probabilistic Data Structures for Web Analytics and Data Mining

Highly Scalable Blog

Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases. On the other hand, when one is interested only in simple additive metrics like total page views or average price of conversion, it is obvious that raw data can be efficiently summarized, for example, on a daily basis or using simple in-stream counters.  Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly. In this article, I provide an overview of probabilistic data structures that allow one to estimate these and many other…

View original post 4,351 more words

T4 Templates for Generating Views for EF4/EF5 Database/Model First apps

Code, the Universe and everything...

After we received a comment on the EF team blog asking why the templates for generating views for Edmx based apps we posted years ago were not updated for EF5 Rowan asked me if I it would be possible to update the template so that it works for EF5. I actually remember looking at this template before I shipped my templates generating views for Code First apps and one of the first things I noticed was that the way it handled versions was robust but not very flexible. Therefore, when we shipped EF5, the templates could not be used with apps using v3 of the EF artifacts (edmx). I thought a bit more about this when I was working on the templates for CodeFirst and concluded that robustness is not really very important here. If someone wants views they will have to provide a valid model (be it Code First…

View original post 435 more words