Code

A few datasets that I’ve seen have come with several different columns representing binary responses to questions. Naturally, there are missing values scattered throughout, so some amount of imputation had to occur. I decided to try coding up a way to do this by picking the mode of rows that were as similar as possible to the row with missing values.

The data being considered is a set of binary response variables to something like yes-no survey questions, not for something like one-hot columns. This method is also just generally very slow, so it’s not recommended for much – it just seemed like an interesting experiment.

This is a quick post intended for animating how the transition matrix of a Markov chain changes between larger time steps, as well as showing the probability of the chain being in any specified state over time. This post uses the tidyverse, along with gganimate.

> library(tidyverse)
> library(magrittr)  ## using some aliases not loaded by default
> library(gganimate)

The magrittr package is a part of the extended tidyverse – i.e., not one of the ones normally loaded. It is the one that supplies the pipe operator (%>%), but it turns out that the package actually contains four pipe operators in total. All are intended to streamline and improve the readability of code, though the three non-basic ones are a bit more situational, and I’ve rarely seen them used, so I thought I would go into them a bit.

The CRAN page for magrittr is here; much of this post is based off of the package’s vignettes and documentation.

A common thing to check in data is whether the values in one column uniquely match to the values of another column. This post is a quick bit of Python code to try to visualize that situation.

As with most database technologies, MongoDB has support for a Date-type object. Writing up operations on date fields in MongoDB can be a little tricky, mostly due to the fact that while the date operators are fairly straightforward, they won’t work in normal find() queries, meaning you need to use the aggregation syntax for anything complicated.

Missing values are inevitable in data science, and handling them is a constant issue. In the case of Boolean logic, it can behave fairly differently depending on the order of arguments and exactly how it is set up, unlike a lot of other data types. Whether this is useful or not depends on the scenario, but the behavior is something to keep in mind.

Partial regression plots – also called added variable plots, among other things – are a type of diagnostic plot for multivariate linear regression models. More specifically, they attempt to show the effect of adding a new variable to an existing model by controlling for the effect of the predictors already in use. They’re useful for spotting points with high influence or leverage, as well as seeing the partial correlation between the response and the new predictor.

This post is meant as a short tutorial on how to set up PySpark to access a MySQL database and run a quick machine learning algorithm with it. Both PySpark and MySQL are locally installed onto a computer running Kubuntu 20.04 in this example, so this can be done without any external resources.

The t-test is a common, reliable way to check for differences between two samples. When dealing with multivariate data, one can simply run t-tests on each variable and see if there are differences. This could lead to scenarios where individual t-tests suggest that there is no difference, although looking at all variables jointly will show a difference. When a multivariate test is preferred, the obvious choice is the Hotelling’s \(T^2\) test.

Hotelling’s test has the same overall flexibility that the t-test does, in that it can also work on paired data, or even a single dataset, though this example will only cover the two-sample case.

If you want a large amount of text data, it’s hard to beat the dump of the English Wikipedia. Even when compressed, the text-only dumps will take up close to 20 gigabytes, and it’ll expand by a factor of 5 to 10 when uncompressed. Effectively handling all of this data can be done on a personal machine, though, due to a combination of two factors – the fact that you can access the data without decompressing it, thanks to the properties of BZ2 files, and the fact that it’s stored as XML data.

I’m going to focus purely on accessing the contents of the pages contained in the September 1, 2020 dump, not any of the multitude of supporting files that come with each dump, including – and especially – the complete page edit histories for each page, which are nearly a terabyte even while compressed. More complete information is on Wikipedia itself, with this page being a good starting point.

Binary Missing Value Imputation

Markov Transition (Animated) Plots

The Four Pipes of magrittr

Unique Values Between Columns

Date Operations in MongoDB

Booleans & NAs

Partial Regression Plots in Julia, Python, and R

PySpark + MySQL Tutorial

Hotelling's T^2 in Julia, Python, and R

How To Read A Wikipedia Dump