Python

Wikipedia & Benford's Law

Benford’s law is the tendency for small digits to be more common than large ones when looking at the first non-zero digits in a large, heterogenous collection of numbers. These frequencies range from about 30% for a leading 1 down to about 4.6% for a leading 9, as opposed to the constant 11.1% you would get if they all appeared at the same rate.

Since I recently wrote about unpacking the pages from a dump of the English Wikipedia, I thought would see if Benford’s law manifested in the text of Wikipedia, as it seems like it fits the idea of a “large, heterogenous collection of numbers” quite well.

The notebook containing the full code is here.

Hotelling's T^2 in Julia, Python, and R

The t-test is a common, reliable way to check for differences between two samples. When dealing with multivariate data, one can simply run t-tests on each variable and see if there are differences. This could lead to scenarios where individual t-tests suggest that there is no difference, although looking at all variables jointly will show a difference. When a multivariate test is preferred, the obvious choice is the Hotelling’s \(T^2\) test.

Hotelling’s test has the same overall flexibility that the t-test does, in that it can also work on paired data, or even a single dataset, though this example will only cover the two-sample case.

How To Read A Wikipedia Dump

If you want a large amount of text data, it’s hard to beat the dump of the English Wikipedia. Even when compressed, the text-only dumps will take up close to 20 gigabytes, and it’ll expand by a factor of 5 to 10 when uncompressed. Effectively handling all of this data can be done on a personal machine, though, due to a combination of two factors – the fact that you can access the data without decompressing it, thanks to the properties of BZ2 files, and the fact that it’s stored as XML data.

I’m going to focus purely on accessing the contents of the pages contained in the September 1, 2020 dump, not any of the multitude of supporting files that come with each dump, including – and especially – the complete page edit histories for each page, which are nearly a terabyte even while compressed. More complete information is on Wikipedia itself, with this page being a good starting point.

Stock Correlation Versus LSTM Prediction Error

When trying to look at examples of LSTMs in Keras, I’ve found a lot that focus on using them to predict stock prices in the future. Most are pretty bare-bones though, consisting of little more than a basic LSTM network and a quick plot of the prediction. Though I think the utility of these models is a little questionable, it brought a question into my head: how accurate are the predictions made by a model trained on one stock if it’s predicting on another stock?

The full code can be found here.

Market Prediction with ETFs & Convolutional Networks

Convolutional networks are most prominently used for image analysis or on data with multiple spatial dimensions. Of course, since the inputs to the CNNs are all just numbers, you can feed in other data that has some a relationship encoded into the dimensions of the array. This post involves feeding data for historical returns from exchange traded funds (ETFs) into a CNN, and using it to try to predict the direction of the Dow Jones Industrial Average (DJIA) some time in the future. I’ll be using Keras to code the neural network. The Jupyter notebook used to develop this code is here.

As with all posts of this nature, this shouldn’t be taken as advice on what to do with your money.

Matrix to LaTeX

I recently had to go through some matrix operations in R and then write up the results in LaTeX. Formatting the R output to get it into a form for LaTeX isn’t particularly hard, but it’s tedious and it has a regular structure, so it seemed like it would be easy to code it up. So I decided to try it for R, Python, and Julia.

Python Random Forest in C

I had occasion a while back to try to do a random forest prediction in C. This is a highly situational need – I only did it because I needed to get a random forest that could work with other stuff written in C, no Python allowed – but it was interesting to try to pull apart scikit-learn’s RandomForestRegressor and restructure it in another way.

Integer String Conversion in Python

Handling numbers as strings is one of those data things that’s a pretty consistent pain. I, personally, have had to deal with translating between binary and hexadecimal strings with some regularity. And this is a situational need, so there’s not much reason to expect something pre-built. So I threw together a quick class myself.

MST3K Episode vs Movie Scores

First broadcast in 1988, Mystery Science Theater 3000 is a television show whose nominal story involves a guy being trapped in space by a couple of mad scientist types…which is actually just an excuse to have a few guys make fun of really, really bad movies. This raises a few unusual questions about the series (as far as TV series go, anyway), like how the movie quality relates to the episode quality. Thankfully, this isn’t too hard to get data on, as we can just look at the IMDB ratings for both.