vsupalov

Notes On Building A 'HN Data Project'

I’ve had a few sessions of looking around HN data. Gathering recent comments and stories to be processed, building a web app around it.

It seems to be a fun data playground / learning project, so I’d like to take it seriously.

The following are my working notes. The current idea is to have a mix of work-log-like entries, and notes around topics, and to refine them over time. Sometimes Sections will be left out as “to be filled in”, sometimes notes will become articles. Questions will be answered, or dropped. Problems will be solved… or dropped.

Let’s see how this goes!

2022-14-21

Rough Vision

  • 4 parts:
  • complete historic HN post data for tinkering
  • analycics lab, using that historic data
  • collect recent posts in a streaming-like fashion (make them available to lab as well?)
  • a way to set alerts, to be notified about incoming data

Data should be collected in as-raw form as possible.

Basically, a playground to build stuff around HN data, and to have a working useful toy product.

Starting Out

I’ve dug out some Python code, which was querying the most recent HN entry id (it counts up, starting at 1), and fetching posts since the last highest id. Basically, a makeshift update stream from HN.

This old code had some issues! Also, it wasn’t made to fetch all-entries-ever. Which I’d like to do.

So I canibalized parts of the old code, and looked into how hard it would be to create a script which fetches ranges of entries, and saves them to disk in json format. It’s not the most elegant, but in theory fetching historical data should only happen once. So it’s okay if it’s not super stable or anything like this.

How To Python Imports Again?

Python 3.something did away with the need for __init__.py files. Thank goodness. Now you can import stuff easier from subfolders. As long as you don’t pick folder names which are the same as existing Python modules. (code for example). See this on importing Python code.

This was a good opportunity to look up how to “re-import” changes to local files in jupyter notebooks. This was a nuissance for me, for a too long time. Solution found here.

import hncode.main as cm
from importlib import reload
reload(cm)

Note: unfortunately, this only holds up for simple single-file import cases. See below for more.

Stop Crashing!

I could build retries into the code, or accept that some requests will be borked. Which will result in “holes” in the data, which can be backfilled easily however.

Right now, the script simply stops working whenever a request fails. I’ll do some error handling, to make sure that it logs them but continues happily.

‘Deploying’ Code, Fetching Data, Keeping Track

How to sync local code with a remote machine? I don’t want to set up a Git(Hub) repository for this. Something like rsync will do just fine? But I’d have to exclude the data folder.

I like the following split between data and code:

remote_project_folder/
  code/
  data/
  NOTES

You can upload the code remote folder <- your local code folder. And sync the remote data folder -> your local data folder.

The NOTES file, is a central “thought dump” space for me within this project

That’s the moment when I notice, that my choice to run the code on a remote machine wasn’t explained yet. It’s easier to use spare bandwidth on an always-on-computer, instead of using precious home wifi bandwidth. No noise, no disruptions.

Running Scripts Remotely

I love tmux. You ssh on the machine, open u tmux (or append to an existing session), start the commands you need, and disconnect again.

Sure, it’s not resistant to reboots of the remote machine, but at this point it’s pretty much okay this way.

Retries ftw

Just hit Google with requests handle connection reset by peer, and found a way to add retry logic to the code. This should help to smooth out occasional failures, and maybe be more resilient and keep the data quality high at the same time.

The tenacity module exists, and seems pretty cool. If you specify how long to wait, and when to stop that is. Otherwise you’ll have endless retries - no good.

Rsync Basics

To copy the content of a folder including subfolders -r is an important flag. The “source” folder needs to have a trailing slash, otherwise the directory gets copied into the target directory, which is probably not what you want.

rsync -r remote:data/ ../data

Waiting For A Remote File

Annoying, I know! Sometimes, you need to wait for a remote file to appear. Did I say sometimes? I guess more like ‘I never thought I would need this ever’, Still - you can do it with SSH. Source.

while true; do ssh -q $HOSTNAME '[[ -f $FILEPATH ]]' && echo "File exists" || echo "File does not exist"; sleep 10; done

Every 10 seconds, it will print whether the remote file exists or doesn’t. Of course in an endless loop.

This will not help check for errors in the running command though. Oh well, at least it will work in the happy case.

Tweaking VIM

I use vscode sometimes, but vim often as well. For Python, I’ll need to fix the

  • enable syntax highlighting
  • spaces instead of tabs

2022-04-22

Unified Data Flows?

Even though it’s attractive to come up with a unified data architecture - everything flowing together in a single setup, it’s not necessary at the moment.

Getting historical data is a single, longer-running process which might be rerun eventually to catch up.

The incremental update gathering is only relevant for alerting reasons - it doesn’t need to lead to direct incorporation into the historical dataset. At least for now.

Eventually, it will be enough for a single regular cron job to get new data entries, inspect them for alert-worthy content, and create alerts. The historical dataset does not need frequent updates.

However, having a way to incorporate data gathered through incremental updates on demand, could be useful, to prevent repeated fetching.

Allow me some an ascii-ish diagram:

historical data -> script -> raw historical data
cron -> incremental data fetching -> events in web application 
                                  -> "data from updates" storage

it could be more complex, but this will do. The thing which I’d like to have reusable is the code used to work on complete historical data and incoming data snippets. This could be a python module, used for tinkering and in the web application later on.

Seeing What Causes Exceptions

Okay, here’s something I don’t know how to do well, and even how to look for a better pattern.

If the code is fresh and !fun!, and you’re getting exceptions. But it would be more useful to see the data causing the exception, instead of trying to look at the stacktrace alone. What to do?

Here’s an awkward gotta-catch-em-all error handling approach to put around the offending line:

import traceback
try:
    pass # imagine a line doing something with a value "v"
except Exception as e:
    print(v)
    traceback.print_exc()
    break # assume this happens in an outer loop

Whenever something goes wrong, the except case triggers, and prints the data which was handled. Assuming, the data is useful to print of course. Other options.

Praise Jupyter

I haven’t mentioned this here before, but Jupyter is AMAZING. It is the best environment I know to investigate data, prototype Python code and follow curious ideas around data in general.

It’s the place where I learn to deal with datasets, prototype data gathering routines and write my POCs if tests or a piece of paper aren’t the better tool for the job.

Easy Python Sorting

Gone are the days where I had to write lambda functions to sort lists in Python. Now I have embraced the glory of:

from operator import itemgetter, attrgetter
sorted_result = sorted(list_of_tuples, key=itemgetter(0))
sorted_result = sorted(list_of_dicts, key=itemgetter("name"))
# and there's the attrgetter thing as well, with a string to look up stuff in objects

With an occasional reverse=True you’re good to go for all your sorting needs. Read more here

2022-04-24

Considering Querying Ergonomics

Right now, all the data is a single Python dictionary, containing id -> entry dict data. Filtering for a single author and sorting by length looks something like this:

filtered = []

for i in data.values():
    if i.get('by', "") == "patio11":
        filtered.append(i)

filtered.sort(key=lambda x: len(x.get("text", "")), reverse=True)

It’s working well enough, and one could transition towards helper functions. However, I would like another style better:

longest_comments = data.filter(by="pg).get_sorted_big_first(lambda x: len(x["text"]))

Having a wrapper around the raw data could make it easier to build reusable building blocks. For example:

  .no_dead()
  .no_deleted()
  .between_years(2006,2007)
  .filter(type="comment")
  .fill_in_gaps() # to add referenced data if recursively if needed for certain queries

A few helper functions for printing data while tinkering, and types to prevent silly mistakes could be great.

Also, less frequent ways to look at the data, or some which don’t return a filtered-down raw-data dict, could be useful eventually. Sometimes you want to get a sorted list, sometimes a slice of data.

I’m pretty sure that some of those functions will make more advanced data operations easier. But which ones will be useful, and which ones are just distractions - only working with the data will tell. I just know, that I’m quite fed up writing the same for loops over and over.

Splitting Python Code

I’ve written about this before, at least I think I have. But in the context of Django. Surprise! It’s a neat trick in non-Django Python projects as well.

You can move code from a file into multiple ones. Here’s the pattern. You start with a somename.py file in its original location. You need to create a somename directory, move the somename.py file into it, and create an __init__.py in the new directory with the following content:

from .somename import *

That’s it. All your existing imports will still work, but you can split the code in the somename file into multiple ones, all imported in the __init__.py file. It’s a helpful initial step to gain a bit more overview in a way-too-long-file.

NOTE: this breaks the reload() style reloading from above…

Jupyter Reloading Magic

You can restart the jupyter kernel everytime code outside of the notebook changes. Or put these two lines at the top of your notebook instead:

%load_ext autoreload
%autoreload 2

Described here. This makes the reload instructions unnecessary - the “2” causes all imports to be reloaded between every code execution. Wasteful? Maybe slightly. But even recursive imports should be scooped up without causing confusing behaviour with this at the very top of the jupyter notebook.

2022-04-25

Plots!

I do love myself some panda and matplotlib. It’s been a while, so a quick jupyter-conscious refresher has been great to stumble over!

Now I will be able to have arbitrary plots of data over time, with a reasonable amount of effort (almost none). At least once I figure out the usual building blocks and tricks again.

Oh no, is matplotlib where autoreload begins to act weird? Maybe tinkering with code and visualizing things should go into different notebooks - one less stable, the other more stable. For now at least?

Plots Indeed

Yep, those are some plots. Converting data between dicts and dataframes needs to have the right orientation, viewing the contents of dataframes is helpful.

Being able to create new columns is neat. Grouping by year is super handy. Aggregating the grouped data works like a charm.

A few more “nice to haves” and those could be usable plots. The nice to haves are:

  • have a fixed number of years in the plot
  • remove legend from aggregation plots
  • make plotting reusable
  • maybe save plots to file?

Guided By Output

I think, that at this point I should try and create something with the help of the tinkering setup built so far.

I’m thinking about an article, sharing some more-or-less interesting bits of data about HN and better-known users. Writing this will help me to see what might be missing, and make sure that it’s not just code for code’s sake, but that shareable artifacts are being created.

Now, what’s something which could be entertaining to read about?

Note: it will eventually be live over here.

Python String Formatting Tricks

This is something I often forget how to do. If you have a format-string, or an f-string, you can do something like this:

"{: <5}"

If it’s an f-string, there would be a variable left of the colon. The effect of the above is, that the string is left-padded with spaces to a length of 5. There is more info here.

Long But Readable Numbers

A tiny trick in Python which I came to appreciate again recently. Instead of writing 123500 you can write 123_500. The underscores are only there for readability, and don’t change the number value.

Preventing Partially Written Data Messups

Have you ever overwritten perfectly-fine data partially by accident? That can happen if the writing script is killed in the process of writing. Nasty stuff!

One way to work around this, is to write to a temporary file, and move it after the writing is complete. This way, the new data is either completely written, or the old one is in place and intact.

Protip: Data-pipeline oriented workflow engines do stuff like this by default. Preventing you from running into walls you didn’t know existed.

2022-04-27

Regexes In Python

Here are a few things I’d like to remember about regular expressions in Python going forward:

import re
rgx = re.compile(REGEX_STRING)
matches = rgx.findall(TEXT)
  • Non-capturing group: (?CONTENT)
  • Non-greedy matching: *?

That’s about it! At least regarding the stuff I always seem to forget.

2022-05-04

Generating HN Titles

That was fun! I really appreciate the Python data-tinkering ecosystem.

The writeup turned out to be a bunch of text, so I moved it over here. I wonder if other methods for generating those titles could turn out to be less nonsensical?