Before I share my observations let me tell you how it all started.
I am building a site named piptrends.com to compare python packages downloads and their GitHub Statistics. You might say why? Whenever I research which python package I need to use for a project I need to check multiple places to finalise it. So I thought why not put all those things in a single place like npmtrends.com. that’s how piptrends.com’s journey started.
Built the site but how can we get traffic on the site and feedback. Then thought of creating pages like top packages and their comparisons. At that time Pareto principle(80/20 rule) came into mind. So wore my data science hat and started digging into the data I have about python packages then found interesting things. Let me share those.
Disclaimer: These are my observations this might be totally wrong so please let me know in the comments section so that we all learn from each other.
At the time of writing this article, around 375,800 python packages are there as per pypi.org. If we total all python packages downloads it will be more than 600 million per day. Amazing right.
If you go a bit deep your mind will be blown away. Let’s dig into it.
For a long time, boto3 is on top of the list with respect to downloads count. It contributes almost 3% of the total downloads. It is created by Amazon Web Services and everyone knows about it. Let me tell you one more interesting thing. Most of the time 3 packages out of the top 5 packages are boto3, botocore, s3transfer and these 3 created by none other than Amazon Web Services. More than 5% of the total downloads come from them.
You all know the Pareto principle(80/20 rule) and we hear a lot that 50% of the world’s net wealth belongs to the top 1%, top 10% of adults hold 85%, while the bottom 90% hold the remaining 15% of the world’s total wealth, top 30% of adults hold 97% of the total wealth. This disparity is way higher on the python land(ecosystem). Let me share those numbers.
50% of downloads belong to the top 100 packages that come around 0.03%, the top 500 packages that come around 0.13% hold 81% of downloads, top 1000 packages that come around 0.27% hold 90% of downloads.
Mind blowing right 🙂
After reading these numbers, I got more and more interested, so I started digging more and analysing the top 1000 packages’ data. This data is publically available on pypi.org so extracted it and started analysing it. Let me share my findings.
Microsoft has the highest packages(97) in the top 1000 packages it’s almost 10%. Their download counts in total come to around 3.4% of the total downloads. The next spot was taken by Google and they have 47 packages and their downloads contribution comes to around 4.2%. Several others are there but interestingly Amazon have 9 packages in the top 1000 but their downloads contribution comes to around 5.6%. Interesting right?
There are a couple of amazing people there who created amazing packages. Sébastien Eustace’s(Poetry author) 10 packages are in the top 1000, Armin Ronacher(flask author) has 10 too, Georg Brandl(pygments author) has 9 and Tom Christie(starlette, uvicorn, httpx author) has 8.
Spoke a lot about Downloads so it’s time to move on to other things/topics.
Most of the packages have MIT, BSD and Apache 2.0 but some packages have different licenses so before using them please don’t forget to check their licenses.
Most of the packages are supported by Python 3.7 or greater than that. There are a few exceptions but if you have python 3.8 then you are in a great place. There couple of packages supporting python 2 still but not 3.0, 3.1, 3.2, 3.3, 3.4.
The average age of the top packages is 7.5 years and mostly between 4 to 9 years. The age distribution looks like this
There are a good number of packages whose age is greater than 10. Packages like numpy, sqlalchemy, pygments, matplotlib are helping python developers for more than 15 years. We should salute the creators and maintainers for making our lives better & making python & its ecosystem amazing. The oldest package in the top 1000 is pytz & youngest one is types-urllib3.
Developers love GitHub everyone knows about it. So no surprise here also almost all packages are using GitHub. Only 5 packages using Bitbucket and 2 packages using GitLab.
The top 3 web frameworks are flask, django, fastapi. Day by day flask image is growing we can clearly see the trend here.
fastapi age is 4 years and django is here for the last 12 years but fastapi almost reached django with respect to downloads. Soon fastapi will take the second spot we can observe it through the below download trend.
There are a couple of interesting patterns seen in the downloads trend. For example, scikit-learn popularity is growing compared to tensorflow, pytorch. Pytorch is slowly decreasing the gap between it & tensorflow.
Whenever any package download graph you see, you will feel like an analogue graph if you closely observe you can see that’s because of the weekends. It’s obvious right, most of the developers won’t work on weekends 😁
There are a couple of gotchas also there which we need to keep in mind.
1. There are 8 inactive packages are there in the top 1000 packages. It means many of us at the time of installing a package does not even see whether it’s active or not. Whenever you decide which python package you want to use don’t forget to see its Development Status also.
2. There are 4 pre-alpha and 44 alpha stage packages there. Please get an idea about the package before installing it and understand what you are really getting into.
3. There are 256 packages which didn’t get updated for more than 1 year. In that 128 packages didn’t get updated for more than 2 years but still, so many of us using them.
Just for fun created a word cloud with the package names.
You can see mostly azure, google are highlighting and then a couple of other things coming, python, pytest, flask, tensorflow & etc.
Before using a package in production please make sure you are using a good package. Do a bit of research. Read its description, check its development status, when was the last release happened, how many developers using it, GitHub stats like stars, open issues & etc.
By the way, as I mentioned in starting of this article I am building piptrends.com to collate everything about the package(s) in a single place so that developers can make a better decision before finalizing a package. Please let me know if you have any suggestions to make it a better product.