vsupalov

My Favourite Way To Start Data Tinkering Projects With Python

Every time I want to tinker with data, I have noticed going through the same incantations again and again.

Of course I don’t remember them every single time, and have to look stuff up. This article is where I want to be looking from now on :)

A Fresh Environment

Assuming Python 3.* is installed on the system, pip and pipenv are installed globally, as well as all libraries needed to build Python packages if needed.

Are they called packages? Or modules? Wheels? Not sure… You know, when you type pip install something and it complains about some header files missing. That stuff.

Now all that’s left to do is create a new directory and enter it:

$ mkdir NEW_PROJECT_NAME
$ cd !$

Note: !$ is a fancy way to say, that the last part of the previous command should be repeated. In this case, NEW_PROJECT_NAME

Now we create a new pipenv environment using Python 3-something:

$ pipenv --three

Note: I heard about people using poetry and being happy with it. I haven’t gotten around to consider it, because pipenv mostly works fine for me. Except sometimes, not so much for dependency resolution where stuff hangs. But at this stage it hasn’t let me down + I remember the syntax :)

Environment created! Now it’s time to install the most essential tool.

Jupyter Or Death

pipenv install jupyter

Jupyter is absolutely non-negotiable. I think it started as iPython? In theory jupyter can handle Julia, Python and R, but I only ever use it for Python. It’s great to tinker with intermediate results, and retry small parts of working with your data, without having to re-download/re-compute big chunks of data every single time or write logic to work around it.

After an initial tinkering period, you can extract code into more stable and reusable scripts. But jupyter is the perfect initial environment to get an understanding of your data, what you want to do with it and for fast iteration times. The most important ingredient.

Pick Your Toppings

Now, the rest depends on what kind of data I’m working with. Packages to talk to a specific API might be added here, but the most common suspects are:

  • beautifulsoup4
  • requests
  • pandas
  • (tqdm)

Beautifulsoup4 is great for working with HTML. It’s messy, and using re to parse them sometimes work, but it’s not something you want to rely on. Once you look up how the most essential commands work and work around starting-gotchas, it’s a very convenient library to handle HTML data.

I’ll be honest here, I stumble through the docs every single time, trying to find those few patterns I really need. So here’s an overview of the most essential parts.

Requests has for a long time been my tool of choice to make… well, HTTP(s) requests. I have a few alternatives on my radar (namely httpx and aiohttp) for making async requests, without having to spin up a threadpool, but I haven’t felt a dire need to switch to them yet.

Here’s a good comparison.

Pandas. Dataframes in Python. This is a great tool to really investigate tabular data. The syntax takes some getting used to (if you’re not fluent with numpy or mayeb some matlab or R?) but it gives you something like superpowers when it comes to calculating and plotting summarizing statistics. To really understand the data you’re dealing with and crunch it in ways you haven’t thought conveniently possible. It goes well with matplotlib!

Honorary mention: tqdm. More of a toy really, but sometimes really handy. You can use it to display progress bars, for those pesky longer-running data crunching tasks. A bit of visual feedback goes a long way to make the work a bit more enjoyable. So why not invest a tiny bit of effort to do so.