vsupalov

Frequent Suffixes And Prefixes In HN Titles

I’ve recently started to play around with HN data. A first look at the first 10 years was interesting enough. But now that the dataset is almost complete, I wanted to dive a bit deeper.

What are frequent suffixes and prefixes of HN submission titles?

Note: the analyzed dataset isn’t quite complete, but large enough to be representative at this point. We have 3 million submitted posts to work on.

Prefixes Containing “HN” And A Colon

Here are the first 20 most-frequent strings containing HN in the beginning or at the end, followed by a colon:

"Ask HN" - 128526
"Show HN" - 98613
"Tell HN" - 2402
"Launch HN" - 622
"Offer HN" - 303
"Apply HN" - 297
"ASK HN" - 288
"Remind HN" - 133
"Ash HN" - 74
"Suggest HN" - 62
"SHOW HN" - 54
" Ask HN" - 47
"Shown HN" - 45
"Request HN" - 42
"Warn HN" - 35
"Help HN" - 33
"Give HN" - 32
"Share HN" - 30
"ASk HN" - 30
"Thank HN" - 28

128k out of 3 million. Huh. Not a lot, but still interesting.

I didn’t expect that there would be no “HN X:” style titles in there!

After “Apply HN” the realm of infrequent prefixes and typos starts. Looking at you, “Ash HN”.

Square Brackets

Having a prefix or suffix of the form [...] was something I kind of remembered seeing from time to time. The results surprised me.

First 10 most-frequent prefixes:

"video" - 626
"pdf" - 129
"Show HN" - 108
"Ask HN" - 81
"Infographic" - 45
"Podcast" - 40
"ANN" - 40
"Video" - 27
"Tutorial" - 24
"C++" - 19

First 10 most-frequent suffixes:

"pdf" - 27029
"video" - 15025
"audio" - 846
"Infographic" - 288
"Video" - 138
"YouTube" - 130
"infographic" - 120
"slides" - 119
"2009" - 111
"2010" - 104

Prefixes in square brackets are really infrequent, compared to suffixes!

In the second case, it seems like people are linking to pdfs and videos a bunch. Well, at least pretty frequently.

Round Brackets

Anything starting or ending with (...).

First 10 most-frequent prefixes:

"Almost" - 20
"2014" - 19
"Un" - 19
"Video" - 17
"2013" - 17
"2009" - 16
"Re" - 16
"Not" - 15
"2012" - 14
"Ab" - 13

First 30 most-frequent suffixes:

"2014" - 5369
"2013" - 4907
"2015" - 4905
"2016" - 4902
"2017" - 4835
"2018" - 4581
"2012" - 3961
"2011" - 3296
"2019" - 3092
"2010" - 2612
"2009" - 2240
"2008" - 1882
"2007" - 1710
"2020" - 1350
"video" - 1336
"2006" - 1313
"2005" - 1127
"2004" - 918
"2003" - 797
"1999" - 727
"2002" - 707
"2001" - 677
"2000" - 658
"1996" - 549
"1998" - 536
"1997" - 533
"" - 494
"Part 1" - 477
"1995" - 387
"Video" - 371

This is used pretty consistently to specify the year. The data should have some gaps in 2021, so this might skew these results a bit.

2014 was interesting! At least for post-worthy content.

Note: I wonder how submissions tagged with the year are performing score-wise. Would 2014 still be ahead if we weighted the success with the number of upvotes? This is something for another time.

Text Followed By A Colon

What if we conly look for a colon, and don’t require “HN” to be in the text?

"Ask HN" - 127401
"Show HN" - 95906
"Tell HN" - 2350
"Ask YC" - 2299
"Study" - 885
"Video" - 832
"Launch HN" - 615
"Report" - 588
"Google" - 581
"Ask PG" - 550
"Review" - 533
"Coronavirus" - 519
"Book Review" - 394
"Tutorial" - 392
"Microsoft" - 372
"Infographic" - 348
"Facebook" - 343
"Apple" - 332
"Interview" - 316
"Elon Musk" - 306

Well, Elon is just ahead of “Offer HN” from above. Also, it’s possible to ask PG instead of HN it seems.

Otherwise, this style is used to tag the content, specify which entity is concerned and what format to expect.

In Closing

That was an interesting way to look at the data!

I’m curious to apply more elaborate methods to find frequent post styles and patterns which have not come up with this method. Something, which would help spot the “who is hiring” type of recurring posts.

There could be interesting hints in this simple analysis which I’m not yet seeing - I suspect they might reveal themselves as follow-up questions with time.

If you want, you can join the conversation about this post on HN. Until the next one!