Quantcast
Channel: Ari Lamstein – Ari Lamstein
Viewing all articles
Browse latest Browse all 135

censusdis v1.4.0 is now on PyPI

$
0
0

I recently contributed a new module to the censusdis package. This resulted in a new version of the package being pushed to PyPI. You can install it like this:

$ pip install censusdis -U

#Verify that the installed version is 1.4.0 
$ pip freeze | grep censusdis 
censusdis==1.4.0 

The module I created is called multiyear. It is very similar to the utils module I created for my hometown_analysis project. This notebook demonstrates how to use the module. You can view the PR for the module here.

This PR caused me to grow as a Python programmer. Since many of my readers are looking to improve their technical skills, I thought to write down some of my lessons learned.

Python Files, Modules vs. Packages

The vocabulary around files, modules and packages in Python is confusing. This PR is when the terms finally clicked:

  • A module is just a normal file with Python code (really). I am not sure why Python invented a new word for this. My best guess is to acknowledge that you can selectively import symbols from a module. This is different than C++ (the first language I programmed in professionally), where #include <file> imports all of the target file.
  • A package is just a directory that contains Python modules. The best practice appears to be putting a file named __init__.py in the directory to denote that it’s a package. Although the file can be empty and is not strictly necessary (link). This also seems like an odd design decision.

One nice thing about this system is that it allows a package to span multiple (sub)directories. In R, all the code for a package must be in a single directory. I always felt that this limited the complexity of packages in R. It’s nice that Python doesn’t have that limitation.

Dependency Management

Python programmers like to talk about “dependency management hell.” This project gave me my first taste of that.

The initial version of the multiyear module used plotly to make the output of graph_multiyear interactive. I used it to do exploratory data analysis in Jupyter notebooks. However, when I tried to share those notebooks via github the images didn’t render: apparently Jupyter notebooks in github cannot render Javascript. The solution I stumbled upon is described here and requires the kaleido package.

The issue? Apparently this solution works with kaleido v0.2.0, but not the latest version of kaleido (link). So anyone who wants this functionality will need to install a specific version of kaleido. In Python this is known as “pinning” a dependency.

Technically, I believe you can do this by modifying the project’s pyproject.toml file by hand. But in practice people use tools like uv or poetry to both manage this file and create a “lockfile” which states the exact version of all packages you’re using. In this project I got experience doing this with both uv (which I used for my hometown_analysis repo) and poetry (which censusdis uses).

Linting

At my last job I advocated for having all the data scientists use a Style Guide. At that company we used R, and people were ok giving up some issues of personal taste in order to make collaboration easier. The process of enforcing adherence to a style guide (or running automated checks on code to detect errors) is called “linting”, and it’s a step we did not take.

In my hometown_analysis repo I regularly used black for this. It appears that black is the most widely used code formatter in the Python world. It was my first time using it on a project, and I simply ran it myself prior to checking in code.

The Censusdis repo takes this a step further:

  • In addition to running black, it recommends contributors also run flake8 and ruff on their code prior to making a PR. For better or for worse, Python seems to have a lot of tools that do linting. There appears to be some overlap in what they all do, and I can’t speak to the unique differences between them. One thing that surprised me is that at least one of them was particular about the format in which I wrote documentation for my functions.
  • It automatically runs all of these tools on each PR using Github Actions (link). If any of the linter tools detects an issue it causes the PR to fail the automated test suite.

Automated Tests

Speaking of tests: I did not feel the need to write them for my utils module for the hometown_analysis project. But censusdis uses pytest and has 99% test coverage (link). So it seemed appropriate to add tests to the multiyear module.

Writing tests is something that I’ve done occasionally throughout my career. Pytest was covered in Matt Harrison’s Professional Python course that I took last year, but I found that I forgot a lot of the material. So I did what most engineers would do: I looked at examples in the codebase and used an LLM to help me.

Type Annotations

I have mixed feelings about Python’s use of Type Annotations.

I began my software engineering career using C++, which is a statically typed language. Every variable in a C++ program must have a type defined at compile time (i.e. before the program executes). Python does not have this requirement, which I initially found freeing. Type annotations, I find, remove a lot of this freedom and also make the code a bit harder to read.

That being said, the censusdis package uses them throughout the codebase, so I added them to my module.

In Professional Python I was taught to run mypy to type check my type annotations. While I believe that my code passed without error, I noticed that the project had a few errors that were not covered in my course. For example:


cli/cli.py:9: error: Skipping analyzing "geopandas": module is installed, but missing library stubs or py.typed marker

It appears that type annotations become more complex when your code uses types defined by third-party libraries (such as Pandas and, in this case, GeoPandas). I researched these errors briefly and created a github issue for them.

Code Review

A major source of learning comes when someone more experienced than you reviews your code. This was one of the main reasons I chose to do this project: Darren (the maintainer of censusdis) is much more experienced than me at building Python packages, and I was interested in his feedback on my module.

Interestingly, his initial feedback was that it would be better if the graph_multiyear function used matplotlib instead of plotly. Not because matplotlib is better than plotly, but because other parts of censusdis already use matplotlib. And there’s value in a package having consistency in terms of which visualization package it uses. This made sense to me, although I do miss the interactive plots that plotly provided!

Conclusion

The book Software Engineering at Google defines software engineering as “programming integrated over time.” The idea is that when code is written for a small project, software engineering best practices aren’t that important. But when code is used over a long period of time, they become essential. This idea stayed with me throughout this project.

    • The first time I did a multi-year analysis of ACS data was for my Covid Demographics Explorer, which I completed last June. I considered the project a one-off. I wrote a single script to download the data and an app to visualize it.
    • For my hometown_analysis project I wanted to do exploratory data analysis of several variables over time. So I wrote a handful of functions to download and visualize multi-year ACS data. I put all the code in a single module and pinned the dependencies. I wrote docstrings for all the functions. I reasoned that if I ever want to do a similar analysis in the future then I could reuse the code.
    • When I wanted to make it easier for others to use the code I added it to an existing package. That required being more rigorous about coding style, adding automated tests and type annotations. It also required me to make design decisions that are best for the overall package, even when they conflict with design decisions I made when working on the module independently.

My impression is that a lot of Python programmers (especially data scientists) have never contributed their code to an existing package. If you are given the opportunity, then I recommend giving it a shot. I found that it helped me grow as a Python programmer.

While I have disabled comments on my blog, I welcome hearing from readers. Use this form to contact me.


Viewing all articles
Browse latest Browse all 135

Trending Articles