I’ve recently started to play around with HN data. Here are a few interesting things I found while getting to know the dataset.
The Data
You can read about the data here.
At the time of writing, there are about 31.150.000 entries. Every single comment, story or other type of post have an index, starting with…
The First Entry
No need to go to the raw data for this. You can simply open it using https://news.ycombinator.com/item?id=1.
It’s a humble 57-points submission of the ycombinator.com site, with a handful of few comments below.
The First 10 Years
For this post, I decided to only look at the first 10-ish years of HN: all entries from 2006 until the end of 2016.
By the end of 2016 HN was up to 13.267.000-ish entries (timezones and the arbitrary time range make more precision pointless).
Hardly anything happened in 2006, but in 2016 there was already enough interactions for the last entry of 2016 to be made in the very last second of the selected timespan.
Let’s look at how the number of submissions developped over the years:
This graph shows all entries - comments, stories and other types. We can see something which looks like steady growth. One thing which jumps out, is the drop of entries in the year 2014 - I wonder what happened there.
Let’s narrow the data and only look at stories:
A caveat here: we are not filtering out “dead” stories to my current knowledge - ones which were effectively blocked and not visible to most users.
We see that there were way less than the total number. The pattern here is similar to the overall activity above. However, the jump in 2011 and 2012 is interesting. It could suggest that HN had some spam issues in those years? It might be interesting to look into.
One more way to look at the data! How many non-comment, non-story submissions were there?
According to the docs (and data), the possible type of any entry is one of 'comment', 'job', 'poll', 'pollopt', 'story'
. So we’re looking at jobs and poll-stuff here. As pollopts
is included, the numbers are probably inflated compared to if they were filtered out.
Anyway, time for some sorted lists!
Stories With Most Comments
What are the stories which have received the most comments in total? Here’s the top 5 for the selected time range, prefixed with the total number of comments:
- 2634 comments @ UK votes to leave EU by dmmalam
- 2385 comments @ Donald Trump is the president-elect of the U.S. by introvertmac
- 1791 comments @ iPhone 7 by benigeri
- 1778 comments @ Donald Trump Is Elected President by koolba
- 1700 comments @ Please tell us what features you’d like in news.ycombinator by pg
Stories With Most Upvotes
- A score of 5771 @ A Message to Our Customers by epaga
- A score of 4338 @ Steve Jobs has passed away. by patricktomas
- A score of 3531 @ Show HN: This up votes itself by olalonde
- A score of 3125 @ UK votes to leave EU by dmmalam
- A score of 3086 @ Tim Cook Speaks Up by replicatorblog
Let’s get sidetracked for a minute! There’s one interesting thing to be said about the score: while comments are stored in a list and, apart from deleted ones, you pretty much know when it happened. The number of upvotes/score/karma of a post however, is only available as a “point in time” value. When looking at these old entries, we can be pretty sure that their score probably hasn’t changed for a while. However, we don’t know when the upvoting happened, nor if the value might change.
Also, when comparing the score of submissions made earlier in the history of HN, there just weren’t that many upvotes to go around. I remember reading an article which compared that “devaluation” to inflation - where a post with 100 upvotes in the first year was way more popular than a post with 100 upvotes 10 years later. Thinking of this, the same is valid for comments?
With all of that in mind - there’s room for interesting things to be discovered around adjust the score/number of comments according to the number of visibly-active users on the site at the time.
We could also use comments to guesstimate for how long there has been activity around a submission, but all of that is something for another time.
Most Nested Comment Threads
Which stories have spawned discussions which went on longer than all others have? Here are three stories and the corresponding thread-tails:
- 48 deep @ Java is Pass-by-Value, Dammit -> last comment
- 44 deep @ Twitter Suspends Prominent Alt-Right Accounts -> last comment
- 43 deep @ On Two Views of Computation in Computer Science -> last comment
Usually clicking context on the last comment gives you a view of the whole conversation. That doesn’t seem to work for me in these cases.
Longest Single Comments
- 43311 characters by pg
- 29701 characters by lionhearted
- 26534 characters by pitdesi
Two of those are copy-pastes of some text which wasn’t available at its original source anymore. The one by pg lists all the banned sites in 2009.
Users With Most Entries Authored
Formatted in an eye-pleasing (citation needed) code block for readability:
1. tptacek - 40183
2. jacquesm - 27374
3. DanBC - 17223
4. dragonwriter - 16060
5. dang - 14476
6. jrockway - 14169
7. mikeash - 13897
8. anigbrowl - 13847
9. rayiner - 13540
10. icebraining - 12870
11. rbanffy - 12860
12. davidw - 12838
13. sp332 - 12792
14. eru - 12745
15. brudgers - 12205
16. pjmlp - 12038
17. ChuckMcM - 11942
18. coldtea - 11913
19. stcredzero - 11815
20. yummyfajitas - 11286
21. TeMPOraL - 11096
22. tokenadult - 10928
23. pg - 10730
24. protomyth - 10565
25. sliverstorm - 10561
26. gaius - 10167
27. Tichy - 10066
28. JoeAltmaier - 10003
29. ColinWright - 9755
30. Retric - 9744
31. patio11 - 9494
Looks like a long-tail distribution. As it should be. The length of the list is directly influenced by me being a patio11 fan and wanting to see where he ended up on the list. This will be relevant to know for future inquiries.
It’s A Wrap!
At least for now. This first look at the data has already yielded a few possibly-interesting questions to follow, and a few facts about the dataset I didn’t know.
I hope it was entertaining to you! If you want, you can join the discussion on HN.
For a next article, I’d like to take a closer look at how some more-prominent HN users have been interacting with the site over the years. Until then!