Word Origin Relationship Graph

This post covers an analysis of word origins from a dataset found here. I found the dataset a while back, and it seemed like a good chance to use it to experiment with directed graphs a bit, as the word origins could be reasonably represented in that way.

Binary Missing Value Imputation

A few datasets that I’ve seen have come with several different columns representing binary responses to questions. Naturally, there are missing values scattered throughout, so some amount of imputation had to occur. I decided to try coding up a way to do this by picking the mode of rows that were as similar as possible to the row with missing values.

The data being considered is a set of binary response variables to something like yes-no survey questions, not for something like one-hot columns. This method is also just generally very slow, so it’s not recommended for much – it just seemed like an interesting experiment.

Markov Transition (Animated) Plots

This is a quick post intended for animating how the transition matrix of a Markov chain changes between larger time steps, as well as showing the probability of the chain being in any specified state over time. This post uses the tidyverse, along with gganimate.

> library(tidyverse)
> library(magrittr)  ## using some aliases not loaded by default
> library(gganimate)

Examination of the K-Means Broken-Line Method

I recently encountered a 2018 paper called “The next-generation \(k\)-means algorithm”. It proposes and compiles advancements and theoretical justifications for \(k\)-means and \(k\)-medians clustering. One part that caught my eye was the proposed “broken-line algorithm” for finding the optimal number of clusters in \(k\)-means. Though it explains how they tested the idea, the paper contains no code and I couldn’t find any other repositories for the paper. This post covers an attempt to replicate the code, as well as attempting a slightly more intensive battery of tests, as there was only one test method in the paper.

The Four Pipes of magrittr

The magrittr package is a part of the extended tidyverse – i.e., not one of the ones normally loaded. It is the one that supplies the pipe operator (%>%), but it turns out that the package actually contains four pipe operators in total. All are intended to streamline and improve the readability of code, though the three non-basic ones are a bit more situational, and I’ve rarely seen them used, so I thought I would go into them a bit.

The CRAN page for magrittr is here; much of this post is based off of the package’s vignettes and documentation.

Depression Preprint Analysis, Part 2

This is the second post in a series that is looking at a collection of preprint papers on a specific topic – in this case, depression. In the previous post, I went through and scraped the website of the Open Science Foundation (OSF) for a list of preprints on the topic. As it turned out, the majority of preprints that dealt with the psychological condition were from PsyArXiv, so I’m focusing this post on topic modeling using only preprints from there.

Depression Preprint Analysis, Part 1

This is the first post in a series focused on trying to analyze the contents of a collection of preprint papers on a topic – in this case, depression. This post involves how I scraped the (initial) website, along with some analysis of basic information from the descriptions of the preprints.

Unique Values Between Columns

A common thing to check in data is whether the values in one column uniquely match to the values of another column. This post is a quick bit of Python code to try to visualize that situation.

Date Operations in MongoDB

As with most database technologies, MongoDB has support for a Date-type object. Writing up operations on date fields in MongoDB can be a little tricky, mostly due to the fact that while the date operators are fairly straightforward, they won’t work in normal find() queries, meaning you need to use the aggregation syntax for anything complicated.

Booleans & NAs

Missing values are inevitable in data science, and handling them is a constant issue. In the case of Boolean logic, it can behave fairly differently depending on the order of arguments and exactly how it is set up, unlike a lot of other data types. Whether this is useful or not depends on the scenario, but the behavior is something to keep in mind.