R

LDA vs QDA vs Logistic Regression

There are plenty of methods to choose from for classification problems, all with their own strengths and weaknesses. This post will try to compare three of the more basic ones: linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and logistic regression.

Hotelling's T^2 in Julia, Python, and R

The t-test is a common, reliable way to check for differences between two samples. When dealing with multivariate data, one can simply run t-tests on each variable and see if there are differences. This could lead to scenarios where individual t-tests suggest that there is no difference, although looking at all variables jointly will show a difference. When a multivariate test is preferred, the obvious choice is the Hotelling’s \(T^2\) test.

Hotelling’s test has the same overall flexibility that the t-test does, in that it can also work on paired data, or even a single dataset, though this example will only cover the two-sample case.

Matrix to LaTeX

I recently had to go through some matrix operations in R and then write up the results in LaTeX. Formatting the R output to get it into a form for LaTeX isn’t particularly hard, but it’s tedious and it has a regular structure, so it seemed like it would be easy to code it up. So I decided to try it for R, Python, and Julia.

An Example With accumulate()

As with most useful (collections of) libraries, the tidyverse has a lot to offer. One interesting bit that I found recently was the accumulate() function in the purrr library, which allows you to apply a function over a succession of values in a vector. This post is a quick example of its use, using linear regression models.

Spotify Cross-Playlist Predictions, Part 2

This is a follow up to the previous post, where the mechanics of making cross-playlist predictions were covered. This post covers the second half of the project: now that we have the analysis method and the important functions worked out in practice, we need to code this functionality into a Shiny app, create a Docker container that holds and runs the app, and deploy the container on Amazon Web Services for public access.

As before, the code is available on Github. It won’t be completely replicated here due to its length.

Spotify Cross-Playlist Predictions, Part 1

This is the first of probably two posts detailing the construction of an RShiny app. The app in question is meant to take data from two Spotify playlists, make recommendations for tracks from one – which I’ll call the “target” playlist – based on the contents of another – the “reference” playlist. I don’t expect this to be comparable in ability to Spotify’s own system (or anything else, really), but it seems like it should be interesting.

My code is here.

Formatting With ggtext Example

This is a quick example regarding the ggtext package. It’s one of the many packages that extends ggplot2, with this one having a focus on adding and formatting text in graphs. The particularly interesting thing for me is that it allows Markdown and other formatting of the labels in a graph.

A Slightly More Advanced MCMC Example

I’ve seen a number of examples of MCMC algorithms, and while they’re all solid, a lot of them tend to be a bit too neat - they have a fairly simple model, a single predictor (maybe two), and not much else. This one is a good example, as it covers the theory in detail, but it’s using an obviously toy data set. So I decided to throw together a slightly more intricate example, highlighting a couple of issues and tricks worth noting for a handwritten implementation.

Note that this post is written under the assumption that the reader already has some knowledge about what MCMC is generally for and broadly how it works. This post is all R code (see here), with no JAGS or BUGS or such. The tidyverse, patchwork, and ISLR libraries are required – the former two for the plots, the latter for the data set used.

Detecting Streaks in R

Inspired by this post, which tries to calculate streaks in Python’s pandas library, I thought I’d give it a try in R, since it’s all just dataframe operations in the Python post. I won’t repeat his analysis, but I will replicate the streak determination and some of the plots. The data he uses is here.

538 Dungeons & Dragons Riddler

This problem was the Riddler Classic on 538 for May 15, 2020. The problem is as follows:

The fifth edition of Dungeons & Dragons introduced a system of “advantage and disadvantage.” When you roll a die “with advantage,” you roll the die twice and keep the higher result. Rolling “with disadvantage” is similar, except you keep the lower result instead. The rules further specify that when a player rolls with both advantage and disadvantage, they cancel out, and the player rolls a single die. Yawn!
There are two other, more mathematically interesting ways that advantage and disadvantage could be combined. First, you could have “advantage of disadvantage,” meaning you roll twice with disadvantage and then keep the higher result. Or, you could have “disadvantage of advantage,” meaning you roll twice with advantage and then keep the lower result. With a fair 20-sided die, which situation produces the highest expected roll: advantage of disadvantage, disadvantage of advantage or rolling a single die?
Extra Credit: Instead of maximizing your expected roll, suppose you need to roll N or better with your 20-sided die. For each value of N, is it better to use advantage of disadvantage, disadvantage of advantage or rolling a single die?

This problem seemed like it could be tackled from both a coding/simulation angle and an analytical angle. So I did both. The solutions can be found here; while the path I take is a bit different, the results are the same.