Saturday, July 15, 2017

Published July 15, 2017 by theboat with 39 comments

dataisbeautiful?

https://www.reddit.com/r/dataisbeautiful/ is a subreddit where people make graphs of various things, and I was curious what languages/tools are most popular there so I tried a crude estimation and it worked out reasonably well. Here's the spoilers before I start:

Algorithm

There's a lot of content there, so I can't read through every one. Thus, I need an automated way to get it. Luckily, there's a python package called PRAW that makes it easier to mine data on reddit. For the mining then, I used the following algorithm:

Get all posts for a day
Filter those posts by [OC] posts as those are most likely to have specified the tools used
Filter the comments of those posts by comments made by the post author since those are the ones that might specify the language
Search those comments for the list of languages that seem most likely and track which ones are found
Repeat for the next day in the list

I also did a few other things (track score of submission, track date of submission, etc.) for additional analyses. This algorithm isn't perfect in that it can miss situations where languages were specified and don't trip my filters (e.g., a typo would kill this) and many posts don't specify languages in the first place. It should work as a pretty good estimator though.

Most used language

First, we need to know how many posts identified languages used. I ran this against the last two years of data, and found that there were 7845 posts flagged [OC], and the algorithm above identified the language(s) used in 4189 of them.

Next, we can do a simple comparison of languages usages vs posts that specified languages, and that's where you get the post at the beginning (% = 100*(posts specifying this language/posts specifying any language)):

Note that the numbers above add up to >100% because some posts specified multiple languages (511 of the 4189 posts with identifiable languages).

And that's the original goal. It's clear that Excel wins by a landslide. I guess it makes sense because almost everyone can use Excel and it's really quick to get plots out. Python dominating MATLAB surprised me at first but makes sense in retrospect since MATLAB is not free and has fewer users (it's just really great for working with data).

Most valuable language

To make it interesting, I wanted to see if any languages predicted more success on reddit. I tried doing that a few different ways. A simple one is to get the average score per post per language:

That looks odd. We can't assume post scores have a normal distribution though, so another test is using medians:

That's a huge disparity between median and average. How weird is the distribution? A histogram with logarithmic bins yields:

That is much clearer to me. One interesting thing is that it spikes up in the 3 to 10 thousand score range, so I'm guessing that's when a post makes it to the front page maybe? An idea then is to look at the score distributions by language:

It's pretty clear from this that excel is more bottom heavy than some of the others. A huge number of posts with a score of 0 used it, and it has very few posts with extremely high scores, especially considering that it is the most popular language/tool for this. It looks like MATLAB and Adobe tools have the highest percentage of high-scoring posts, but they have so few samples it's hard to know. Among the popular languages/tools, Python and R appear to do best.

A final way to answer what languages/tools are most likely to yield a high score is to see what percentage of posts using the language/tool yield a score above 100:

This just reinforces the takeaways from the histograms (it's basically the same information in a different form) and I'll stop there...

Probable biases in this data

I would guess that the following occurred to some degree:

some languages are probably more prone to typos...e.g., maybe a lot of people typed 'Tablaeu instead of 'Tableau'...if that's the case, those languages would be undercounted by my crude algorithm
a lot of OC posts don't specify the language(s) used and there might be a bias there...I wouldn't be shocked for example if a larger percentage of those actually used Excel or Tableau than something like MATLAB
a lot of people probably specify something like 'plotly' as the tool used that would make the actual language used ambiguous even though it definitely wasn't excel in that case
I personally submit a lot of posts using MATLAB. I think roughly 10% of the MATLAB posts are mine, and I usually submit low-quality posts that get very few upvotes (I don't think I've ever broken a score of 100 on this subreddit). Thus, I have personally hurt MATLAB's performance.

I'll think about more robust ways to catch all of these and might do this again at some point in the future. As a note, I did the data gathering and plotting in Python, but Excel doesn't have as many top posts so I redid all of the plots using Excel in hopes of breaking the trend.

My code for scanning the posts can be found here: https://github.com/rhamner/dataisbeautiful_languageFrequency

math

edit

39 comments:

marksonSeptember 15, 2019 at 10:07 AM
The primary period of a Data researcher's activity is understanding the issue, gathering pertinent information, getting ready and deciphering the gathered information, model arranging and examination, representation of the demonstrated information, and at last, conveying it in the required condition.ExcelR Data Science Courses
ReplyDelete
Replies
ramizJune 18, 2020 at 8:14 PM
Awesome and interesting article. Great things you've always shared with us. Thanks. Just continue composing this kind of post.
Ciencia de Datos México
ReplyDelete
Replies
AnonymousJuly 25, 2020 at 9:06 AM
Actually I read it yesterday but I had some thoughts about it and today I wanted to read it again because it is very well written. The Random Blogger
ReplyDelete
Replies
EXCELRJuly 28, 2020 at 5:48 PM
Thank you for sharing such a really admire your post. Your post is great!
data science course in Hyderabad
ReplyDelete
Replies
EXCELRSeptember 15, 2020 at 10:09 AM
Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.data science courses
ReplyDelete
Replies
PEPECUSTOMOctober 29, 2020 at 4:58 AM
Thank you very much for this great post. 화상영어
ReplyDelete
Replies
360digiTMG TrainingFebruary 24, 2021 at 10:57 PM

It's really nice and meaningful. it's a really cool blog. Linking is a very useful thing.you have really helped lots of people who visit blogs and provide them useful information.
Digital Marketing Course
ReplyDelete
Replies
data scientist courseMarch 18, 2021 at 12:32 AM
I see some amazingly important and kept up to length of your strength searching for in your on the site
data scientist course in hyderabad
ReplyDelete
Replies
AnonymousApril 30, 2021 at 10:28 PM
Informative blog post thanks for sharing.
SEO Training In Hyderabad
SEO stands for search engine optimization. It is the process of ranking your website at the top of the search results for a particular set of keywords. SEO experts will try to rank a specific page on the top of the search results. SEO can increase your brand’s visibility, thus creating brand awareness.
ReplyDelete
Replies
DevOps Training In HyderabadJuly 27, 2021 at 4:21 AM
Your good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this…

DevOps Training in Hyderabad
ReplyDelete
Replies
DevOps Training In HyderabadJuly 27, 2021 at 4:21 AM
Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging…

DevOps Training in Hyderabad
ReplyDelete
Replies
DevOps Training In HyderabadJuly 27, 2021 at 4:21 AM
Thank you for taking the time and sharing this information with us. It was indeed very helpful and insightful while being straight forward and to the point…

DevOps Training in Hyderabad
ReplyDelete
Replies
salomeDecember 6, 2021 at 4:38 AM
very interesting to read AWS certification Training in Chennai
ReplyDelete
Replies
Ramesh SampangiFebruary 13, 2022 at 12:08 AM
Really awesome blog and informative content. Thanks for sharing with us. If you want to become a data scientist, then check out the following link.
Data Science Course with Placements in Hyderabad
ReplyDelete
Replies
AnonymousApril 28, 2022 at 3:21 PM
MMORPG OYUNLAR
İNSTAGRAM TAKİPCİ SATIN AL
tiktok jeton hilesi
tiktok jeton hilesi
antalya saç ekimi
İNSTAGRAM TAKİPÇİ SATIN AL
instagram takipci satin al
METİN2 PVP SERVERLAR
Takipçi Satın Al
ReplyDelete
Replies
PrinceMay 17, 2022 at 5:34 AM
HRMS Software India
Excellent information you have shared, thanks for taking the time to share with us such a great article. I really appreciate your work.
ReplyDelete
Replies
AnonymousMay 17, 2022 at 10:42 PM
FON PERDE MODELLERİ
sms onay
Mobil ödeme bozdurma
Nft nasıl alınır
ankara evden eve nakliyat
trafik sigortası
dedektör
web sitesi kurma
aşk kitapları
ReplyDelete
Replies
PrinceMay 21, 2022 at 3:29 AM
Wonderful blog found to be very impressive to come across such an awesome blog. I should really appreciate the blogger for the efforts they have put in to develop such an amazing content for all the curious readers who are very keen of being updated across every corner. Ultimately, this is an awesome experience for the readers. Anyways, thanks a lot and keep sharing the content in future too.
Cloud Telephony Software
ReplyDelete
Replies
AnonymousMay 31, 2022 at 11:53 PM
smm panel
Smm Panel
İS İLANLARİ BLOG
İNSTAGRAM TAKİPÇİ SATIN AL
Hirdavatci Burada
beyazesyateknikservisi.com.tr
servis
Tiktok para hilesi indir
ReplyDelete
Replies
Muskan February 21, 2023 at 12:26 AM
I like your blogs. Python is one of the most popular programming language. Python course in Greater Noida is the best place where you can start your career.
ReplyDelete
Replies
Data ScienceNovember 3, 2024 at 1:45 AM
This analysis of /r/dataisbeautiful provides some intriguing insights into the most popular and impactful tools for creating data visualizations. It’s not surprising that Excel leads in usage due to its accessibility and ease of use, but the standout performance of Python and R in high-scoring posts shows their strength in more complex, shareable visualizations. The analysis also highlights a significant skew in post scores, suggesting that while Excel may be ubiquitous, tools like MATLAB and Python offer a higher potential for standout posts. This information could guide aspiring data visualizers in choosing their tools based on their goals—whether for ease or for impact. Data science courses in Gurgaon
ReplyDelete
Replies
SadhviNovember 7, 2024 at 11:47 PM
This is a fantastic analysis! It’s really interesting to see how popular tools like Excel dominate the /r/dataisbeautiful submissions, which aligns with its accessibility and ease of use.
Data science courses in Visakhapatnam

ReplyDelete
Replies
Data Analytics Courses In OntarioNovember 8, 2024 at 3:34 AM
"This is a well-written post! For those interested in gaining professional data science skills, I suggest looking into the Data science courses in Brighton. The program offers detailed courses designed to equip students with practical knowledge and hands-on experience in data analysis, machine learning, and more."

ReplyDelete
Replies
VijayNovember 17, 2024 at 11:33 PM
A great post thankyou for sharing.

Data science courses in Pune
ReplyDelete
Replies
LocaXionNovember 18, 2024 at 11:06 PM
This is a fascinating analysis of language trends on /r/dataisbeautiful! I love how you combined data mining with thoughtful insights about tool popularity and post success. The comparison between averages and medians adds depth, and the biases section shows your thoroughness. Great work!
Data science courses in Gujarat

ReplyDelete
Replies
Data Analytics Courses In OntarioNovember 20, 2024 at 8:28 AM
"Fantastic post! If you’re serious about building a career in data science, you should definitely explore the Data Science courses in Kochi. They offer great programs to help you get started."
ReplyDelete
Replies
sakshi.gupta.universityNovember 24, 2024 at 2:14 AM
The most popular programming languages on /r/dataisbeautiful are often related to data visualization and analysis. Commonly used languages include Python (due to libraries like Matplotlib, Seaborn, and Plotly), R (for its strong statistical analysis capabilities), and JavaScript (for interactive web-based visualizations using D3.js).

Data science courses in Pune
ReplyDelete
Replies
IIM Skills Data ScienceDecember 3, 2024 at 1:34 AM
Great article! It’s fascinating to see how languages like Python and JavaScript dominate across different platforms.

Data science Courses in Canada
ReplyDelete
Replies
P. Zaheer KhanDecember 3, 2024 at 9:03 PM
Fantastic analysis! It’s always fun to learn about the community dynamics, and this post does that perfectly.
Data science Courses in Sydney
ReplyDelete
Replies
RICHADecember 3, 2024 at 10:42 PM
"Great post on the most popular programming languages! I appreciate how you’ve analyzed trends and highlighted the growth of certain languages over time. Your insights into the factors influencing language popularity are helpful for anyone considering which languages to learn. Thanks for sharing this detailed and informative breakdown!"
Data science courses in the Netherlands
ReplyDelete
Replies
Abar SinghDecember 8, 2024 at 4:24 AM
The breakdown of the most popular languages on the internet is very insightful! It's a helpful read for businesses looking to expand their global reach.

Data science courses in France

ReplyDelete
Replies
maanu tyagiDecember 12, 2024 at 5:33 AM
This comment has been removed by the author.
ReplyDelete
Replies
maanu tyagiDecember 12, 2024 at 6:00 AM
Thank you for sharing your expertise with us.
technical writing course
ReplyDelete
Replies
AnonymousDecember 13, 2024 at 4:21 AM
The analysis of popular languages is both intriguing and informative. The article highlights preferences for data visualization tools like Python and R. The visual breakdown and insightful commentary offer valuable context for aspiring contributors, providing a clearer understanding of the subreddit's culture and its audience's technical inclinations.
Thank you.
Data science Courses in Berlin
ReplyDelete
Replies
iim skills DikshaDecember 18, 2024 at 4:22 AM
Thank you for sharing such an informative article. Amazing Article.
Data science Courses in Ireland
ReplyDelete
Replies
AnjaliDecember 30, 2024 at 7:08 AM
Thank you for sharing such an informative article. Amazing Article.
Data Analytics Courses In Chennai
ReplyDelete
Replies
usha singhJanuary 6, 2025 at 1:59 AM
This post provides a comprehensive look at the popularity of various programming languages. It's fascinating to see how trends have shifted over time. Thank you for compiling such detailed data and insights!
digital marketing course in chennai fees
ReplyDelete
Replies
kritishaJanuary 7, 2025 at 8:13 AM
Fascinating insights! GitHub trends are definitely a great indicator of what languages are growing in popularity. I'm particularly interested in seeing how languages like Rust and Go continue to rise in prominence as developers look for more efficient and concurrent programming solutions
Top 10 Digital marketing courses in pune
ReplyDelete
Replies
reenaiimskillsJanuary 23, 2025 at 8:11 AM
"Fantastic post! If you’re serious about building a career in data science, you should definitely explore the Data Science courses in Kochi. They offer great programs to help you get started."
top 10 digital marketing agency in delhi
ReplyDelete
Replies