vsupalov

The First 10 Years Of HN In Data

I’ve recently started to play around with HN data. Here are a few interesting things I found while getting to know the dataset.

The Data

You can read about the data here.

At the time of writing, there are about 31.150.000 entries. Every single comment, story or other type of post have an index, starting with…

The First Entry

No need to go to the raw data for this. You can simply open it using https://news.ycombinator.com/item?id=1.

It’s a humble 57-points submission of the ycombinator.com site, with a handful of few comments below.

The First 10 Years

For this post, I decided to only look at the first 10-ish years of HN: all entries from 2006 until the end of 2016.

By the end of 2016 HN was up to 13.267.000-ish entries (timezones and the arbitrary time range make more precision pointless).

Hardly anything happened in 2006, but in 2016 there was already enough interactions for the last entry of 2016 to be made in the very last second of the selected timespan.

Let’s look at how the number of submissions developped over the years:

All HN entries by year between 2006 and 2016.

This graph shows all entries - comments, stories and other types. We can see something which looks like steady growth. One thing which jumps out, is the drop of entries in the year 2014 - I wonder what happened there.

Let’s narrow the data and only look at stories:

HN Stories by year between 2006 and 2016.

A caveat here: we are not filtering out “dead” stories to my current knowledge - ones which were effectively blocked and not visible to most users.

We see that there were way less than the total number. The pattern here is similar to the overall activity above. However, the jump in 2011 and 2012 is interesting. It could suggest that HN had some spam issues in those years? It might be interesting to look into.

One more way to look at the data! How many non-comment, non-story submissions were there?

HN neither-comments-nor-stories by year between 2006 and 2016.

According to the docs (and data), the possible type of any entry is one of 'comment', 'job', 'poll', 'pollopt', 'story'. So we’re looking at jobs and poll-stuff here. As pollopts is included, the numbers are probably inflated compared to if they were filtered out.

Anyway, time for some sorted lists!

Stories With Most Comments

What are the stories which have received the most comments in total? Here’s the top 5 for the selected time range, prefixed with the total number of comments:

Stories With Most Upvotes

Let’s get sidetracked for a minute! There’s one interesting thing to be said about the score: while comments are stored in a list and, apart from deleted ones, you pretty much know when it happened. The number of upvotes/score/karma of a post however, is only available as a “point in time” value. When looking at these old entries, we can be pretty sure that their score probably hasn’t changed for a while. However, we don’t know when the upvoting happened, nor if the value might change.

Also, when comparing the score of submissions made earlier in the history of HN, there just weren’t that many upvotes to go around. I remember reading an article which compared that “devaluation” to inflation - where a post with 100 upvotes in the first year was way more popular than a post with 100 upvotes 10 years later. Thinking of this, the same is valid for comments?

With all of that in mind - there’s room for interesting things to be discovered around adjust the score/number of comments according to the number of visibly-active users on the site at the time.

We could also use comments to guesstimate for how long there has been activity around a submission, but all of that is something for another time.

Most Nested Comment Threads

Which stories have spawned discussions which went on longer than all others have? Here are three stories and the corresponding thread-tails:

Usually clicking context on the last comment gives you a view of the whole conversation. That doesn’t seem to work for me in these cases.

Longest Single Comments

Two of those are copy-pastes of some text which wasn’t available at its original source anymore. The one by pg lists all the banned sites in 2009.

Users With Most Entries Authored

Formatted in an eye-pleasing (citation needed) code block for readability:

 1. tptacek      - 40183
 2. jacquesm     - 27374
 3. DanBC        - 17223
 4. dragonwriter - 16060
 5. dang         - 14476
 6. jrockway     - 14169
 7. mikeash      - 13897
 8. anigbrowl    - 13847
 9. rayiner      - 13540
10. icebraining  - 12870
11. rbanffy      - 12860
12. davidw       - 12838
13. sp332        - 12792
14. eru          - 12745
15. brudgers     - 12205
16. pjmlp        - 12038
17. ChuckMcM     - 11942
18. coldtea      - 11913
19. stcredzero   - 11815
20. yummyfajitas - 11286
21. TeMPOraL     - 11096
22. tokenadult   - 10928
23. pg           - 10730
24. protomyth    - 10565
25. sliverstorm  - 10561
26. gaius        - 10167
27. Tichy        - 10066
28. JoeAltmaier  - 10003
29. ColinWright  -  9755
30. Retric       -  9744
31. patio11      -  9494

Looks like a long-tail distribution. As it should be. The length of the list is directly influenced by me being a patio11 fan and wanting to see where he ended up on the list. This will be relevant to know for future inquiries.

It’s A Wrap!

At least for now. This first look at the data has already yielded a few possibly-interesting questions to follow, and a few facts about the dataset I didn’t know.

I hope it was entertaining to you! If you want, you can join the discussion on HN.

For a next article, I’d like to take a closer look at how some more-prominent HN users have been interacting with the site over the years. Until then!