This is the second post in a series that is looking at a collection of preprint papers on a specific topic – in this case, depression. In the previous post, I went through and scraped the website of the Open Science Foundation (OSF) for a list of preprints on the topic. As it turned out, the majority of preprints that dealt with the psychological condition were from PsyArXiv, so I’m focusing this post on topic modeling using only preprints from there.
This is the first post in a series focused on trying to analyze the contents of a collection of preprint papers on a topic – in this case, depression. This post involves how I scraped the (initial) website, along with some analysis of basic information from the descriptions of the preprints.
A common thing to check in data is whether the values in one column uniquely match to the values of another column. This post is a quick bit of Python code to try to visualize that situation.
As with most database technologies, MongoDB has support for a Date-type object. Writing up operations on date fields in MongoDB can be a little tricky,
mostly due to the fact that while the date operators are fairly straightforward, they won’t work in normal
find() queries, meaning you need to use the aggregation syntax for anything complicated.
Missing values are inevitable in data science, and handling them is a constant issue. In the case of Boolean logic, it can behave fairly differently depending on the order of arguments and exactly how it is set up, unlike a lot of other data types. Whether this is useful or not depends on the scenario, but the behavior is something to keep in mind.
A while ago, I heard an episode of Freakonomics Radio which discussed games and strategies for playing them. It stuck to pretty simple games, so there were no excursions into game theory or such, but the part about Hangman caught my interest. Discussions in that part largely amounted to the use of conditional probability – since you know something about the word you’re trying to guess, you might be able to come up with a better strategy than blind guessing or just guessing based on the frequency of letters in the English language.
Code for this was written in Python.
Partial regression plots – also called added variable plots, among other things – are a type of diagnostic plot for multivariate linear regression models. More specifically, they attempt to show the effect of adding a new variable to an existing model by controlling for the effect of the predictors already in use. They’re useful for spotting points with high influence or leverage, as well as seeing the partial correlation between the response and the new predictor.
This post is meant as a short tutorial on how to set up PySpark to access a MySQL database and run a quick machine learning algorithm with it. Both PySpark and MySQL are locally installed onto a computer running Kubuntu 20.04 in this example, so this can be done without any external resources.
Benford’s law is the tendency for small digits to be more common than large ones when looking at the first non-zero digits in a large, heterogenous collection of numbers. These frequencies range from about 30% for a leading 1 down to about 4.6% for a leading 9, as opposed to the constant 11.1% you would get if they all appeared at the same rate.
Since I recently wrote about unpacking the pages from a dump of the English Wikipedia, I thought would see if Benford’s law manifested in the text of Wikipedia, as it seems like it fits the idea of a “large, heterogenous collection of numbers” quite well.
The notebook containing the full code is here.
The t-test is a common, reliable way to check for differences between two samples. When dealing with multivariate data, one can simply run t-tests on each variable and see if there are differences. This could lead to scenarios where individual t-tests suggest that there is no difference, although looking at all variables jointly will show a difference. When a multivariate test is preferred, the obvious choice is the Hotelling’s \(T^2\) test.
Hotelling’s test has the same overall flexibility that the t-test does, in that it can also work on paired data, or even a single dataset, though this example will only cover the two-sample case.