This is a quick example regarding the
ggtext package. It’s one of the many packages that extends
ggplot2, with this one having a focus on adding and formatting text in graphs. The particularly interesting thing for me is that it allows Markdown and other formatting of the labels in a graph.
This is a quick example regarding the
I’ve seen a number of examples of MCMC algorithms, and while they’re all solid, a lot of them tend to be a bit too neat - they have a fairly simple model, a single predictor (maybe two), and not much else. This one is a good example, as it covers the theory in detail, but it’s using an obviously toy data set. So I decided to throw together a slightly more intricate example, highlighting a couple of issues and tricks worth noting for a handwritten implementation.
Note that this post is written under the assumption that the reader already has some knowledge about what MCMC is generally for and broadly how it works. This post is all R code (see here), with no JAGS or BUGS or such. The
ISLR libraries are required – the former two for the plots, the latter for the data set used.
Inspired by this post, which tries to calculate streaks in Python’s
pandas library, I thought I’d give it a try in R, since it’s all just dataframe operations in the Python post. I won’t repeat his analysis, but I will replicate the streak determination and some of the plots. The data he uses is here.
This problem was the Riddler Classic on 538 for May 15, 2020. The problem is as follows:
The fifth edition of Dungeons & Dragons introduced a system of “advantage and disadvantage.” When you roll a die “with advantage,” you roll the die twice and keep the higher result. Rolling “with disadvantage” is similar, except you keep the lower result instead. The rules further specify that when a player rolls with both advantage and disadvantage, they cancel out, and the player rolls a single die. Yawn!
There are two other, more mathematically interesting ways that advantage and disadvantage could be combined. First, you could have “advantage of disadvantage,” meaning you roll twice with disadvantage and then keep the higher result. Or, you could have “disadvantage of advantage,” meaning you roll twice with advantage and then keep the lower result. With a fair 20-sided die, which situation produces the highest expected roll: advantage of disadvantage, disadvantage of advantage or rolling a single die?
Extra Credit: Instead of maximizing your expected roll, suppose you need to roll N or better with your 20-sided die. For each value of N, is it better to use advantage of disadvantage, disadvantage of advantage or rolling a single die?
This problem seemed like it could be tackled from both a coding/simulation angle and an analytical angle. So I did both. The solutions can be found here; while the path I take is a bit different, the results are the same.
Among all probability distributions, the normal distribution is probably the most well-established and well-characterized. The importance of things like the central limit theorem and the normality assumptions in linear regression highlight it well.
One of the more interesting ones is the fact that you can approximate a binomial distribution with a normal one. Using a continuous distribution to approximate a discrete one feels a little weird, and there are certain assumptions needed for it to work, but it raises an interesting question – how normal can other distributions look?
Bayesian statistics is centered on constructing certain assumptions about how the probability of an event is distributed, and then adjusting that belief as new information comes in. It can be more involved to construct a Bayesian model as opposed to the “look at many things in aggregate” approach used in frequentist statistics. But it has nice properties, and we’ll take a look at them in a real albeit fairly unimportant context: the Pokemon video games.
I had occasion a while back to try to do a random forest prediction in C. This is a highly situational need – I only did it because I needed to get a random forest that could work with other stuff written in C, no Python allowed – but it was interesting to try to pull apart scikit-learn’s
RandomForestRegressor and restructure it in another way.
Handling numbers as strings is one of those data things that’s a pretty consistent pain. I, personally, have had to deal with translating between binary and hexadecimal strings with some regularity. And this is a situational need, so there’s not much reason to expect something pre-built. So I threw together a quick class myself.
First broadcast in 1988, Mystery Science Theater 3000 is a television show whose nominal story involves a guy being trapped in space by a couple of mad scientist types…which is actually just an excuse to have a few guys make fun of really, really bad movies. This raises a few unusual questions about the series (as far as TV series go, anyway), like how the movie quality relates to the episode quality. Thankfully, this isn’t too hard to get data on, as we can just look at the IMDB ratings for both.