vsupalov

Generating HN Titles Using Markov Chains In Python

A simple way to create a machine learning model which can generate text are markov chains.

I had to adjust my workflows at this point, because just loading the whole dataset into memory started causing performance issues… So filtering submissions while loading them from disk was the way to go.

Having all HN submission titles, it was pretty quick to filter out successful submissions (ones which received at least a few comments or upvotes), and provide them to a ready-made markov chain library - markovify.

Here’s what the juicy parts of the code look like:

import markovify

# all_titles is just a list of strings
text_model = markovify.NewlineText(all_titles)

for i in range(5):
    print(text_model.make_sentence())

Awesome, right? It’s so easy!

For the sake of amusement, here is an output sample:

Google, Facebook legislation to legalize marijuana
Explicit Trusted Proxy in Go
Can we sell a tiny subset of Python code, won't make upgrade deadline
Explaining to my well-being
Twitter Is About to Change Someone's Mind

Some of those sound HNy, some like satire, others pretty far-fetched. That’s because a markov chain does not care if any of this makes sense! It only cares about probabilities.

Another way to generate output is to use make_short_sentence. Here’s the output limited to 280 characters:

IBM halting sales of alcohol
Lessons learned from my Senator regarding SOPA
JavaScript as a developer?

Solid.

If you’re interested in one particular word, there’s a way to brute-force-filter some examples like this:

must_contain = "Bitcoin"
desired_number = 5
results = []

while len(results) < desired_number:
    s = text_model.make_sentence()
    if must_contain in s:
        results.append(s)

for i in results:
    print(i)

It terminates eventually. The results in this case were a bit underwhelming:

Show HN: Days Away From Bitcoin. It's A Mirage
How Microsoft made it an app that makes Bitcoin stronger
Investors Bet Big on Bitcoin
Bitcoin Core switching from the browser
Bitcoin at an earlier decision by someone who wants it?

How about looking for sentences generated with “Ask HN:“?

Ask HN: What is the skillset required to sign executive order banning transactions with the iPod?
Ask HN: Took a Pay Cut to Infrastructure-as-Declarative-Code?
Ask HN: How to Survive Downturn
Ask HN: Sys Admin Career Advice – WhoIsHiring – According to a mass casualty incident
Ask HN: Do you get Android wrong in the Ruby gem

Utter nonsense. But great fun! You can read more about my tinkering with HN data over here.