Recent comments in /f/dataisbeautiful

michaelGaryScarn009 OP t1_j6fimqx wrote

What

A well-researched Reddit post identified a statistical anomaly in the home vs away steals and rebounds for DPOY favorite Jaren Jackson Jr. (JJJ). (link)

Why

It's several things I like (basketball, variance, Reddit, etc) wrapped up in an opportunity to make a data viz.

Graph

- Per 36 min steals + blocks (stocks)

- Size of point = absolute difference of home and away average

Conclusion

Just looking at the viz, yea, this is exceptional behavior. Others have determined it's a sub 1% chance that this variation could be random. The NBA has clarified that an NBA representative, and not the local team, makes the official stat determinations. NBA media has reviewed the calls for JJJ, and has only found several questionable calls. But, we're just looking at data.

Data

- From NBA.com on 01-29-2023

- Top 50 players in per 36 min steals and blocks

-- Home and away

-- At least 8 games played for home and away

1

qwerty6731 t1_j6f35mb wrote

The double use of the same colours is confusing. Also, the legend for the lines is too small and out of the way.

Suggestion: Keep the three colours for the players, enlarge the legend and bring it closer to the lines, then change the bottom win distributions to monochrome shades by tournament, based on the colour associated with each player.

12

PartisanPlayground OP t1_j6eo5hz wrote

You're hitting on the most subjective part of this whole process. I've run into all of the issues you describe, and the question is ultimately: how do you define a story?

Your GOP primaries example is a good one. Let's say we have articles on Trump's legal issues, other articles on Pence's classified documents, and other articles on DeSantis and books. Now let's say all of these articles describe these things in the context of the 2024 GOP primaries. Is this one story called "GOP primaries"? Or three separate stories? You could make a case either way.

I've tuned the algorithm to split stories in a way that "looks about right" to me. That's subjective, but there's no way around it. This is an issue whether you're using an algorithm or doing this manually.

A related challenge is that story definitions may change over time. The classified documents story is a good example for this. Right now there are articles on Trump, Biden, and Pence all mishandling classified documents. The algorithm is categorizing all of them as the same story (fair enough).

But let's say that next week (just making this up), Trump gets indicted for it. Is that a separate story now? If so, how do you treat that? Do you retroactively split out the "Trump" portion of the "classified documents" story as though they were not the same story before? Do you show the classified documents story splitting into two? Do you just create a new story on the day the indictment happens? Currently, the algorithm is set up to do the first of these, but again, you could make a case for any of them.

All of this is to say that there is subjectivity involved in this process.

1

ghostfaceschiller t1_j6ej1pw wrote

I recently put together a repo of character frequency analyses, bc they can be really useful when designing keyboard layouts. So I have an eye out rn for interesting ways to look at and visualize the data. I think this particular instance is probably too limited to be useful for keyboard layouts, but if you do anything more please let me know! It’s one of the more interesting visualizations I’ve seen so I’d love to include/link it

https://github.com/dschil138/word-and-character-frequencies

2

kilopeter OP t1_j6ecjw9 wrote

I haven't, but good point. The code to count transitions between characters is very straightforward (well... I wrote mine without worrying about performance issues), and in principle could be packaged as a lightweight web app or even a JavaScript-powered static site and accept any text corpus uploaded or linked by the user.

2