This post is meant as a short tutorial on how to set up PySpark to access a MySQL database and run a quick machine learning algorithm with it. Both PySpark and MySQL are locally installed onto a computer running Kubuntu 20.04 in this example, so this can be done without any external resources.
The t-test is a common, reliable way to check for differences between two samples. When dealing with multivariate data, one can simply run t-tests on each variable and see if there are differences. This could lead to scenarios where individual t-tests suggest that there is no difference, although looking at all variables jointly will show a difference. When a multivariate test is preferred, the obvious choice is the Hotelling’s \(T^2\) test.
Hotelling’s test has the same overall flexibility that the t-test does, in that it can also work on paired data, or even a single dataset, though this example will only cover the two-sample case.
If you want a large amount of text data, it’s hard to beat the dump of the English Wikipedia. Even when compressed, the text-only dumps will take up close to 20 gigabytes, and it’ll expand by a factor of 5 to 10 when uncompressed. Effectively handling all of this data can be done on a personal machine, though, due to a combination of two factors – the fact that you can access the data without decompressing it, thanks to the properties of BZ2 files, and the fact that it’s stored as XML data.
I’m going to focus purely on accessing the contents of the pages contained in the September 1, 2020 dump, not any of the multitude of supporting files that come with each dump, including – and especially – the complete page edit histories for each page, which are nearly a terabyte even while compressed. More complete information is on Wikipedia itself, with this page being a good starting point.
Convolutional networks are most prominently used for image analysis or on data with multiple spatial dimensions. Of course, since the inputs to the CNNs are all just numbers, you can feed in other data that has some a relationship encoded into the dimensions of the array. This post involves feeding data for historical returns from exchange traded funds (ETFs) into a CNN, and using it to try to predict the direction of the Dow Jones Industrial Average (DJIA) some time in the future. I’ll be using Keras to code the neural network. The Jupyter notebook used to develop this code is here.
As with all posts of this nature, this shouldn’t be taken as advice on what to do with your money.
I recently had to go through some matrix operations in R and then write up the results in LaTeX. Formatting the R output to get it into a form for LaTeX isn’t particularly hard, but it’s tedious and it has a regular structure, so it seemed like it would be easy to code it up. So I decided to try it for R, Python, and Julia.
As with most useful (collections of) libraries, the tidyverse has a lot to offer. One interesting bit that I found recently was the
accumulate() function in the
purrr library, which allows you to apply a function over a succession of values in a vector. This post is a quick example of its use, using linear regression models.
This is a quick example regarding the
ggtext package. It’s one of the many packages that extends
ggplot2, with this one having a focus on adding and formatting text in graphs. The particularly interesting thing for me is that it allows Markdown and other formatting of the labels in a graph.
I’ve seen a number of examples of MCMC algorithms, and while they’re all solid, a lot of them tend to be a bit too neat - they have a fairly simple model, a single predictor (maybe two), and not much else. This one is a good example, as it covers the theory in detail, but it’s using an obviously toy data set. So I decided to throw together a slightly more intricate example, highlighting a couple of issues and tricks worth noting for a handwritten implementation.
Note that this post is written under the assumption that the reader already has some knowledge about what MCMC is generally for and broadly how it works. This post is all R code (see here), with no JAGS or BUGS or such. The
ISLR libraries are required – the former two for the plots, the latter for the data set used.
Inspired by this post, which tries to calculate streaks in Python’s
pandas library, I thought I’d give it a try in R, since it’s all just dataframe operations in the Python post. I won’t repeat his analysis, but I will replicate the streak determination and some of the plots. The data he uses is here.
I had occasion a while back to try to do a random forest prediction in C. This is a highly situational need – I only did it because I needed to get a random forest that could work with other stuff written in C, no Python allowed – but it was interesting to try to pull apart scikit-learn’s
RandomForestRegressor and restructure it in another way.