Saturday, July 15, 2017

Published July 15, 2017 by with 47 comments

What Languages Are Most Popular On /r/dataisbeautiful?

https://www.reddit.com/r/dataisbeautiful/ is a subreddit where people make graphs of various things, and I was curious what languages/tools are most popular there so I tried a crude estimation and it worked out reasonably well. Here's the spoilers before I start:


Algorithm

There's a lot of content there, so I can't read through every one. Thus, I need an automated way to get it. Luckily, there's a python package called PRAW that makes it easier to mine data on reddit. For the mining then, I used the following algorithm:
  • Get all posts for a day
  • Filter those posts by [OC] posts as those are most likely to have specified the tools used
  • Filter the comments of those posts by comments made by the post author since those are the ones that might specify the language
  • Search those comments for the list of languages that seem most likely and track which ones are found
  • Repeat for the next day in the list
I also did a few other things (track score of submission, track date of submission, etc.) for additional analyses. This algorithm isn't perfect in that it can miss situations where languages were specified and don't trip my filters (e.g., a typo would kill this) and many posts don't specify languages in the first place. It should work as a pretty good estimator though.

Most used language

First, we need to know how many posts identified languages used. I ran this against the last two years of data, and found that there were 7845 posts flagged [OC], and the algorithm above identified the language(s) used in 4189 of them.

Next, we can do a simple comparison of languages usages vs posts that specified languages, and that's where you get the post at the beginning (% = 100*(posts specifying this language/posts specifying any language)):



Note that the numbers above add up to >100% because some posts specified multiple languages (511 of the 4189 posts with identifiable languages).

And that's the original goal. It's clear that Excel wins by a landslide. I guess it makes sense because almost everyone can use Excel and it's really quick to get plots out. Python dominating MATLAB surprised me at first but makes sense in retrospect since MATLAB is not free and has fewer users (it's just really great for working with data).

Most valuable language

To make it interesting, I wanted to see if any languages predicted more success on reddit. I tried doing that a few different ways. A simple one is to get the average score per post per language:


That looks odd. We can't assume post scores have a normal distribution though, so another test is using medians:



That's a huge disparity between median and average. How weird is the distribution? A histogram with logarithmic bins yields:



That is much clearer to me. One interesting thing is that it spikes up in the 3 to 10 thousand score range, so I'm guessing that's when a post makes it to the front page maybe? An idea then is to look at the score distributions by language:








It's pretty clear from this that excel is more bottom heavy than some of the others. A huge number of posts with a score of 0 used it, and it has very few posts with extremely high scores, especially considering that it is the most popular language/tool for this. It looks like MATLAB and Adobe tools have the highest percentage of high-scoring posts, but they have so few samples it's hard to know. Among the popular languages/tools, Python and R appear to do best.

A final way to answer what languages/tools are most likely to yield a high score is to see what percentage of posts using the language/tool yield a score above 100:



This just reinforces the takeaways from the histograms (it's basically the same information in a different form) and I'll stop there...

Probable biases in this data

I would guess that the following occurred to some degree:
  • some languages are probably more prone to typos...e.g., maybe a lot of people typed 'Tablaeu instead of 'Tableau'...if that's the case, those languages would be undercounted by my crude algorithm
  • a lot of OC posts don't specify the language(s) used and there might be a bias there...I wouldn't be shocked for example if a larger percentage of those actually used Excel or Tableau than something like MATLAB
  • a lot of people probably specify something like 'plotly' as the tool used that would make the actual language used ambiguous even though it definitely wasn't excel in that case
  • I personally submit a lot of posts using MATLAB. I think roughly 10% of the MATLAB posts are mine, and I usually submit low-quality posts that get very few upvotes (I don't think I've ever broken a score of 100 on this subreddit). Thus, I have personally hurt MATLAB's performance.
I'll think about more robust ways to catch all of these and might do this again at some point in the future. As a note, I did the data gathering and plotting in Python, but Excel doesn't have as many top posts so I redid all of the plots using Excel in hopes of breaking the trend.

My code for scanning the posts can be found here: https://github.com/rhamner/dataisbeautiful_languageFrequency


      edit

47 comments:

  1. The primary period of a Data researcher's activity is understanding the issue, gathering pertinent information, getting ready and deciphering the gathered information, model arranging and examination, representation of the demonstrated information, and at last, conveying it in the required condition.ExcelR Data Science Courses

    ReplyDelete
  2. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Correlation vs Covariance
    Simple linear regression

    ReplyDelete
  3. Awesome and interesting article. Great things you've always shared with us. Thanks. Just continue composing this kind of post.
    Ciencia de Datos México

    ReplyDelete
  4. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Correlation vs Covariance
    Simple linear regression
    data science interview questions

    ReplyDelete
  5. You must have a lot of pride in writing quality content. I'm impressed with the amount of solid information you have written in your article. I hope to read more.
    Best Data Science training in Mumbai

    Data Science training in Mumbai

    ReplyDelete
  6. Such a very useful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. data science course in coimbatore

    ReplyDelete
  7. Actually I read it yesterday but I had some thoughts about it and today I wanted to read it again because it is very well written. The Random Blogger

    ReplyDelete
  8. Thank you for sharing such a really admire your post. Your post is great!
    data science course in Hyderabad

    ReplyDelete
  9. This Was An Amazing ! I Haven't Seen This Type of Blog Ever ! Thankyou For Sharing, data sciecne course in hyderabad

    ReplyDelete
  10. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.data science courses

    ReplyDelete
  11. Hi there, I found your blog via Google while searching for such kinda informative post and your post looks very interesting for me
    business analytics course

    ReplyDelete
  12. Thanks for sharing this, I actually appreciate you taking the time to share with everybody.
    Data Science Training In Hyderabad

    ReplyDelete
  13. Thank you very much for this great post. 화상영어

    ReplyDelete
  14. I've read this post and if I could I desire to suggest you some interesting things or suggestions. Perhaps you could write next articles referring to this article. I want to read more things about it!
    data science courses

    ReplyDelete

  15. It's really nice and meaningful. it's a really cool blog. Linking is a very useful thing.you have really helped lots of people who visit blogs and provide them useful information.
    Digital Marketing Course

    ReplyDelete
  16. Thanks for posting the best information and the blog is very helpful.Data science course in Faridabad

    ReplyDelete
  17. I see some amazingly important and kept up to length of your strength searching for in your on the site
    data scientist course in hyderabad

    ReplyDelete
  18. Fantastic blog extremely good well enjoyed with the incredible informative content which surely activates the learners to gain the enough knowledge. Which in turn makes the readers to explore themselves and involve deeply in to the subject. Wish you to dispatch the similar content successively in future as well.
    Data Science Training in Raipur

    ReplyDelete
  19. Thanks for posting the best information and the blog is very helpful.data science interview questions and answers

    ReplyDelete
  20. Informative blog post thanks for sharing.
    SEO Training In Hyderabad
    SEO stands for search engine optimization. It is the process of ranking your website at the top of the search results for a particular set of keywords. SEO experts will try to rank a specific page on the top of the search results. SEO can increase your brand’s visibility, thus creating brand awareness.

    ReplyDelete
  21. Really wonderful blog completely enjoyed reading and learning to gain the vast knowledge. Eventually, this blog helps in developing certain skills which in turn helpful in implementing those skills. Thanking the blogger for delivering such a beautiful content and keep posting the contents in upcoming days.

    data science institute in bangalore

    ReplyDelete
  22. I must admit that your post is really interesting. I have spent a lot of my spare time reading your content. Thank you a lot!
    data scientist training and placement in hyderabad

    ReplyDelete
  23. Just pure brilliance from you here. I have never expected something less than this from you and you have not disappointed me at all. I suppose you will keep the quality work going on.
    data scientist training in hyderabad

    ReplyDelete
  24. Thanks for posting the best information and the blog is very important.digital marketing institute in hyderabad

    ReplyDelete
  25. Thanks for posting the best information and the blog is very important.artificial intelligence course in hyderabad

    ReplyDelete
  26. Thanks for posting the best information and the blog is very important.data science institutes in hyderabad

    ReplyDelete
  27. Your good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this…

    DevOps Training in Hyderabad

    ReplyDelete
  28. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging…

    DevOps Training in Hyderabad

    ReplyDelete
  29. Thank you for taking the time and sharing this information with us. It was indeed very helpful and insightful while being straight forward and to the point…

    DevOps Training in Hyderabad

    ReplyDelete
  30. I was actually browsing the internet for certain information, accidentally came across your blog found it to be very impressive. I am elated to go with the information you have provided on this blog, eventually, it helps the readers whoever goes through this blog. Hoping you continue the spirit to inspire the readers and amaze them with your fabulous content.

    Data Science Course in Faridabad

    ReplyDelete
  31. I was just examining through the web looking for certain information and ran over your blog.It shows how well you understand this subject. Bookmarked this page, will return for extra. data science course in vadodara

    ReplyDelete
  32. I want to leave a little comment to support and wish you the best of luck.we wish you the best of luck in all your blogging enedevors.
    aws training in hyderabad

    ReplyDelete
  33. Extremely overall quite fascinating post. I was searching for this sort of data and delighted in perusing this one. Continue posting. A debt of gratitude is in order for sharing. data scientist course in delhi

    ReplyDelete
  34. I blog often and I truly appreciate your content.
    야설

    Feel free to visit my blog :
    야설

    ReplyDelete
  35. This great article has truly peaked my interest.
    일본야동
    Feel free to visit my blog : 일본야동

    ReplyDelete
  36. I’m going to bookmark your site and keep checking for new details about once per week.
    국산야동
    Feel free to visit my blog : 국산야동

    ReplyDelete
  37. I subscribed to your Feed too.
    일본야동
    Feel free to visit my blog : 일본야동

    ReplyDelete
  38. Hi there! This article could not be written much better!
    야설
    Feel free to visit my blog : 야설

    ReplyDelete
  39. Ucuz, kaliteli ve organik sosyal medya hizmetleri satın almak için Ravje Medyayı tercih edebilir ve sosyal medya hesaplarını hızla büyütebilirsin. Ravje Medya ile sosyal medya hesaplarını organik ve gerçek kişiler ile geliştirebilir, kişisel ya da ticari hesapların için Ravje Medyayı tercih edebilirsin. Ravje Medya internet sitesine giriş yapmak için hemen tıkla: ravje.com

    İnstagram takipçi satın almak için Ravje Medya hizmetlerini tercih edebilir, güvenilir ve gerçek takipçilere Ravje Medya ile ulaşabilirsin. İnstagram takipçi satın almak artık Ravje Medya ile oldukça güvenilir. Hemen instagram takipçi satın almak için Ravje Medyanın ilgili sayfasını ziyaret et: instagram takipçi satın al

    Tiktok takipçi satın al istiyorsan tercihini Ravje Medya yap! Ravje Medya uzman kadrosu ve profesyonel ekibi ile sizlere Tiktok takipçi satın alma hizmetide sunmaktadır. Tiktok takipçi satın almak için hemen tıkla: tiktok takipçi satın al

    İnstagram beğeni satın almak için Ravje medya instagram beğeni satın al sayfasına giriş yap, hızlı ve kaliteli instagram beğeni satın al: instagram beğeni satın al

    Youtube izlenme satın al sayfası ile hemen youtube izlenme satın al! Ravje medya kalitesi ile hemen youtube izlenme satın almak için tıklayın: youtube izlenme satın al

    Twitter takipçi satın almak istiyorsan Ravje medya twitter takipçi satın al sayfasına tıkla, Ravje medya güvencesi ile organik twitter takipçi satın al: twitter takipçi satın al

    ReplyDelete
  40. This is really very nice post you shared, i like the post, thanks for sharing..
    data scientist course in malaysia

    ReplyDelete
  41. Amazing Article! I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.If you are Searching for info click on given link
    Data science course in pune

    ReplyDelete