Recent comments in /f/dataisbeautiful

terrykrohe OP t1_j27gjbq wrote

... the missing persons data for the fifty states shows random character: the t-test value of 0.96 indicates that the means of the Rep and Dem states can be attributed to random fluctuations

... when this data is compared with data for GDP, suicide, life expectancy, etc (see post 14Apr), the contrast is impressive; and the wonder of it all is "why is there a Rep and Dem difference in the data?" (note: Rep states are always on the negative side of the comparison: less GDP, more obese, more infant mortality, etc)

1

MrBookman_LibraryCop OP t1_j27g8za wrote

Inspired by this post about the recent World Cup I thought I'd have a look at something similar for the Grand Slam tennis tournaments. The result is the graph above.

The graph shows the average return per match you'd get if you were to place a $1 bet on each match in a round, and if you did that consistently on either the underdog or the favourite. It's based on data for 2007 onwards, noting that I've aggregated quarterfinals, semifinals and finals because of the low number of observations you'd get otherwise.

The data is from http://www.tennis-data.co.uk/ and the plot is made using R (ggplot). The original datasets contain odds from numerous betting firms that differ by tournament and year, so I've taken the mean odds across whatever was available for each match.

So, should you bet on the favourite or the underdog? Well, neither really, unless we're talking about the men's quarter finals and beyond at Roland Garros, The men's fourth round at the US open, the women's final at the AO and Wimbledon, and the women's fourth round at Roland Garros and the US Open.

This should not be taken as financial advice or sound strategy in any way, shape or form, so my last advice is just to watch and enjoy the matches without stressing over losing money!

4

terrykrohe OP t1_j27a6ot wrote

other comments for missing persons VS 'rural-urban'
i) The missing persons metric is the only metric which can be described as "random":
thus, it provides contrast for non-random metrics
– the non-random character of other metrics is emphasized when visualized against the missing persons visual
ii) the Alaska outlier point is curious: probably due to boating and winter incidents for which no bodies are found.

2

terrykrohe OP t1_j279xgo wrote

sources
– missing persons https://namus.nij.ojp.gov
The National Missing and Unidentified Persons System (NamUs), US Census Bureau 2020 Population Data
– population density https://www.states101.com/populations (2014 population estimates)
– agriculture income https://data.ers.usda.gov/reports.aspx?ID=17839#P9dd070795569412d9525def18d45bde2_4_185iT0R0x0

method for "rural-urban" metric
– population density and agriculture income data values were converted to "standard scores", aka "z-scores": z-score = (data value \[Dash] mean)/SD (see Wikipedia, "Standard score")
– the z-scores were added and divided by 2; result = the 'rural-urban' metric z-score
– note1: 'urban' means "increasing population density"
'rural' means "increasing agriculture income as % of state GDP"
for the 'rural' metric to denote a "rural to urban" value,
the z-scores for agriculture income were 'reversed' by multiplying by "\[Dash]1"
before adding to the population density z-scores
– note2: "NCE" is "normal curve equivalent" (see Wikipedia, "Normal curve equivalent")
tool: Mathematica

***************

top two plots
Missing persons and 'rural-urban' metrics: note that missing persons t-test indicates that data fluctuations are probably "random" in character.
The large difference of 'rural-urban' means (> 1 SD) for Rep and Dem states indicate that Rep and Dem states are different Sample populations.

the bottom plot
– Missing persons VS 'rural-urban" predictor metric: the r-value of -0.11 indicates that the data is essentially "noise" about the best-fit line.
– Note that purple is used for best-fit line, mean, and SD because the Rep and Dem states data are NOT different Sample populations.

1

WaterScienceProf OP t1_j2782pb wrote

Water in the atmosphere is a near infinite resource. It stays up on average only 8-10 days, being continuously regenerated by the sun. I don’t mean to be dismissive, but the amount of water in the atmosphere dwarfs currently used freshwater sources by orders of magnitude. And unlike other methods, it doesn’t produce wastestreams, which can be ecologically damaging especially if said wastewater is salty and far from an ocean.

When we pump in dirty water for things besides drinking, it’s called greywater reuse, and is actually far more widespread than AWH.

The real concerns for AWH are around its energy intensity, which is many times that of conventional sources- as a result it’s likely not economically viable for use beyond ultra pure water. And if it’s not powered by renewables it may not be sustainable. And renewable power is still resource intensive to create. You are right to criticize it, but you focused on the wrong issue!

Sources: https://hess.copernicus.org/articles/21/779/2017/hess-21-779-2017.html https://greywateraction.org/greywater-reuse/

1

Dear_Spring7657 t1_j271ax7 wrote

What do the splits represent and why is it ordered in this way?

Typically, this sort of diagram is used to go from least specific to most specific left to right, with the ones on the right being dependent on the ones on the left.

The slices in this diagram seem to be pretty much completely independent of eachother (they merge), I think this data would be much more easily consumed in a series of pie charts.

3