80/20 Rule in

Data Science


The 80/20 rule, also known as the Pareto principle, is a widely-used rule of thumb in data science. It states that roughly 80% of the effects come from 20% of the causes. This means that in any given situation, a small number of factors are likely to be responsible for the majority of the outcomes.

One example of the 80/20 rule in data science is in the field of predictive modeling. When building a predictive model, it is common to find that a small number of features (20%) are responsible for the majority of the predictive power (80%) of the model. This means that it is often not necessary to use all available features to build an effective model, and that a more parsimonious model with only the most important features can often be just as effective.

Another example of the 80/20 rule in data science is in the analysis of customer data. In many cases, it is found that a small percentage of customers (20%) account for the majority of a company’s revenue (80%). This means that it is often more effective for a company to focus its marketing efforts on retaining and expanding business with these high-value customers, rather than trying to attract new customers.

Additional examples:

  1. In a dataset of customer purchases, 80% of the revenue is generated by 20% of the customers.
  2. In a dataset of web traffic, 80% of the pageviews come from 20% of the pages on the website.
  3. In a dataset of employee performance, 80% of the productivity is contributed by 20% of the employees.
  4. In a dataset of customer feedback, 80% of the complaints are related to 20% of the products.
  5. In a dataset of social media followers, 80% of the engagement is generated by 20% of the followers.

Overall, the 80/20 rule is a useful heuristic in data science that can help researchers and analysts identify the most important factors in a given situation, and focus their efforts on those factors in order to achieve the greatest impact.