myrianthe t1_j6bd1x9 wrote on January 29, 2023 at 2:54 AM

Reply to comment by coodgee33 in Transition probabilities (shown as percentages) between successive letters in the names of girls born in 2021 in the USA [OC] by kilopeter

Yup. U is the most common letter to come after Q in general, not just in names. Because interestingly there isn't really another vowel nor letter that works after Q (Qa Qe Qi Qo Qr Ql?)

Some of the more popular Q names include Quinn, Quincy, Queen/Queenie, and Quintessa.

myrianthe t1_j6bcix9 wrote on January 29, 2023 at 2:50 AM

Reply to comment by rug1998 in Transition probabilities (shown as percentages) between successive letters in the names of girls born in 2021 in the USA [OC] by kilopeter

U is the most common letter to come after Q in general, not just in names. Because interestingly there isn't really another vowel nor letter that works after Q (Qa Qe Qi Qo Qr Ql?)

Some of the more popular Q names include Quinn, Quincy, Queen/Queenie, and Quintessa.

rug1998 t1_j6bc16j wrote on January 29, 2023 at 2:46 AM

Reply to comment by kilopeter in Transition probabilities (shown as percentages) between successive letters in the names of girls born in 2021 in the USA [OC] by kilopeter

Bro… ok got it

kilopeter OP t1_j6bbmom wrote on January 29, 2023 at 2:43 AM

Reply to comment by mikeholczer in Transition probabilities (shown as percentages) between successive letters in the names of girls born in 2021 in the USA [OC] by kilopeter

It does if you include the placeholder "characters" for the start and end of each name! The most probable "name" A represents three tokens: [name start], A, [name end]. And if you generate many names using the transition matrix, you will indeed observe that the frequency of [name start] -> A and A -> [name end] matches the corresponding frequencies in the source data.

EDIT: on reflection, I agree with you. I should introduce the heatmap as a description of transition probabilities, but should avoid walking the reader through using the transition matrix to generate new "names." I should separate the topic of generating new names using the transition matrix under the (invalid) Markov assumption as a diversion. Thanks for pointing out the flaw in my explanation. I'll edit my top level comment when I have a chance!

tomiwa1a t1_j6bbj7o wrote on January 29, 2023 at 2:42 AM

Reply to comment by insane9001 in [OC] Youtube has over 1 billion hours of videos, we Built an AI Search Engine that can find exact timestamps for anything on Youtube by simonezchen

The Y Axis is number of books. I agree with you though, That was an oversight on our part. I also don't like when graphs don't have a labelled Y-Axis. Next time we'll add them.

tomiwa1a t1_j6bbdzn wrote on January 29, 2023 at 2:41 AM

Reply to comment by EICONTRACT in [OC] Youtube has over 1 billion hours of videos, we Built an AI Search Engine that can find exact timestamps for anything on Youtube by simonezchen

Watch the demo. Youtube doesn't give matches this precise.

tomiwa1a t1_j6bb9zk wrote on January 29, 2023 at 2:40 AM

Reply to comment by Thenerdy9 in [OC] Youtube has over 1 billion hours of videos, we Built an AI Search Engine that can find exact timestamps for anything on Youtube by simonezchen

You can try it here: https://atlas.atila.ca/

tomiwa1a t1_j6bb8e8 wrote on January 29, 2023 at 2:40 AM

Reply to comment by Chramir in [OC] Youtube has over 1 billion hours of videos, we Built an AI Search Engine that can find exact timestamps for anything on Youtube by simonezchen

Exactly! This is how it works.

I agree it's not perfect, but remember, Youtube itself is not a library so any comparisons to real libraries will require some degree of approximation. You can think of it as an approximate estimate or my preferred term, a Fermi Estimate.

kilopeter OP t1_j6bb7id wrote on January 29, 2023 at 2:40 AM

Reply to comment by kismatwalla in Transition probabilities (shown as percentages) between successive letters in the names of girls born in 2021 in the USA [OC] by kilopeter

The fact that names are not memoryless is exactly why treating them as such produces such entertaining results :)

mikeholczer t1_j6bb50x wrote on January 29, 2023 at 2:39 AM

Reply to comment by kilopeter in Transition probabilities (shown as percentages) between successive letters in the names of girls born in 2021 in the USA [OC] by kilopeter

If one follows your steps, the most common outcome is one letter and there has no between-letter patterns which clearly doesn’t match the between-letter patterns of the source data.

kismatwalla t1_j6bb15t wrote on January 29, 2023 at 2:38 AM

Reply to Transition probabilities (shown as percentages) between successive letters in the names of girls born in 2021 in the USA [OC] by kilopeter

Names are not memory-less, transition probability does not make sense.

tomiwa1a t1_j6bapiz wrote on January 29, 2023 at 2:36 AM

Reply to comment by Purplekeyboard in [OC] Youtube has over 1 billion hours of videos, we Built an AI Search Engine that can find exact timestamps for anything on Youtube by simonezchen

The reason that happens is because unless someone has previously submitted a youtube video with "I gotta have more cowbell" we won't have it in our index.

>The transcripts get added on-demand when users request to search for a video. It wouldn't make sense to index the entire database given it's large size. We're also able to get the transcripts pretty quickly, so there's no need to pre-cache the transcripts if a user has never asked for it before.A more detailed overview of how it works can be found here:

See: earlier comment

kilopeter OP t1_j6banxn wrote on January 29, 2023 at 2:36 AM

Reply to comment by mikeholczer in Transition probabilities (shown as percentages) between successive letters in the names of girls born in 2021 in the USA [OC] by kilopeter

Oh? What part? I specifically qualified my interpretation with "want to reflect typical between-letter patterns of US girl names."

That's the point of using this viz to generate new names: generating character strings with totally realistic letter-to-letter transition probabilities is not enough to yield plausible names, or names which already exist. The generated names are often bizarre or excessively long, yet their character transition probabilities exactly reflect that of the real names in the input dataset.

tomiwa1a t1_j6baiw7 wrote on January 29, 2023 at 2:34 AM

Reply to comment by Ruleyoumind in [OC] Youtube has over 1 billion hours of videos, we Built an AI Search Engine that can find exact timestamps for anything on Youtube by simonezchen

Yup, here! https://atlas.atila.ca/

tomiwa1a t1_j6bagzz wrote on January 29, 2023 at 2:34 AM

Reply to comment by MurdrWeaponRocketBra in [OC] Youtube has over 1 billion hours of videos, we Built an AI Search Engine that can find exact timestamps for anything on Youtube by simonezchen

Thanks! The transcripts get added on-demand when users request to search for a video. It wouldn't make sense to index the entire database given it's large size. We're also able to get the transcripts pretty quickly, so there's no need to pre-cache the transcripts if a user has never asked for it before.

A more detailed overview of how it works can be found here:

tomiwa1a t1_j6ba59m wrote on January 29, 2023 at 2:31 AM

Reply to comment by ZeusTheRecluse in [OC] Youtube has over 1 billion hours of videos, we Built an AI Search Engine that can find exact timestamps for anything on Youtube by simonezchen

The other interesting piece is that Library of Congress was founded in 1800 (though a fire caused it to restart it's collection in 1815).

Youtube was founded in 2005.

So in just 17 years, Youtube has amassed a collection of information that is 57% the size of the world's largest library which has been accumulating it's collection for over 200 years.

I'm also Canadian. Hadn't heard of it either until we did this report. We probably haven't heard it because we likely won't need to use any of it's resources. Public libraries already do a really good job for most of our day to day needs.
Wikipedia's small size makes sense given that contributions are heavily restricted and have such a high bar. Imagine if every Youtube video had to be approved by a editors before or every author had to have their books approved by editors before publishing.

mikeholczer t1_j6b9was wrote on January 29, 2023 at 2:29 AM

Reply to comment by kilopeter in Transition probabilities (shown as percentages) between successive letters in the names of girls born in 2021 in the USA [OC] by kilopeter

Yeah, I think the display of the data is interesting, I just think what you wrote about it is misleading.

kilopeter OP t1_j6b9oys wrote on January 29, 2023 at 2:28 AM

Reply to comment by rug1998 in Transition probabilities (shown as percentages) between successive letters in the names of girls born in 2021 in the USA [OC] by kilopeter

That 95 means that 95% of the time, a Q was followed by a U in a name.

kilopeter OP t1_j6b9h0h wrote on January 29, 2023 at 2:26 AM

Reply to comment by mikeholczer in Transition probabilities (shown as percentages) between successive letters in the names of girls born in 2021 in the USA [OC] by kilopeter

Oh, absolutely: the fact that this Markov assumption yields nonsensical names shows that the sequence of letters in given names are not generated by a Markov process. (The next character depends very much on previous characters, not just the current one.)

But this visualization does accurately present the relative frequencies of character transitions in actual names. Using these frequencies to generate Markov chains of characters and calling the results names is a fun diversion whose results I found entertaining.

tomiwa1a t1_j6b8wnz wrote on January 29, 2023 at 2:22 AM

Reply to comment by worriedshuffle in [OC] Youtube has over 1 billion hours of videos, we Built an AI Search Engine that can find exact timestamps for anything on Youtube by simonezchen

Can you please clarify? what do you mean by it isn't clear how books on Youtube is calculated?

If you check this range you can see how we arrived at our numbers:

We calculated the number of hours of video uploaded to Youtube every minute from 2007-2022 source: statista
We found how many words are spoken per hour of human conversation source: virtualspeech
We calculated the number of words in the average book source: jericho writers

Then we did some calcualations with those numbers to arrive at 99,338,400 books on Youtube

tomiwa1a t1_j6b8gcr wrote on January 29, 2023 at 2:18 AM

Reply to comment by NovaticFlame in [OC] Youtube has over 1 billion hours of videos, we Built an AI Search Engine that can find exact timestamps for anything on Youtube by simonezchen

Y Axis is the number of books. You're right though, the Y Axis should definitely have been there.

You can see the details of those calculations here: https://docs.google.com/spreadsheets/d/1UbekWhTLJKQj6ZLipg1R269CQ8g0ACDbzPRDFN14inc/edit#gid=52223737

Context for the Y-Axis

rug1998 t1_j6b80wv wrote on January 29, 2023 at 2:15 AM

Reply to Transition probabilities (shown as percentages) between successive letters in the names of girls born in 2021 in the USA [OC] by kilopeter

95 names with Qu? I can’t think of one

tomiwa1a t1_j6b80iu wrote on January 29, 2023 at 2:15 AM

Reply to comment by DenL4242 in [OC] Youtube has over 1 billion hours of videos, we Built an AI Search Engine that can find exact timestamps for anything on Youtube by simonezchen

I don't think it's fair to say that comparing Youtube to a Library is like comparing Mt. Everest to a Cow. For one thing, there is actually a pretty clever way to estimate the amount of text on Youtube and compare it to the amount of text in a library.

Maybe, if I explain how we made the graph you'll see that it's more apples to apples than mountains to cows:

We calculated the number of hours of video uploaded to Youtube every minute from 2007-2022 source: statista
We found how many words are spoken per hour of human conversation source: virtualspeech
We calculated the number of words in the average book source: jericho writers

Then we did some calcualations with those numbers to arrive at 99,338,400 books on Youtube

You can see the details of those calculations here: https://docs.google.com/spreadsheets/d/1UbekWhTLJKQj6ZLipg1R269CQ8g0ACDbzPRDFN14inc/edit#gid=52223737

snerp t1_j6b7yj8 wrote on January 29, 2023 at 2:15 AM

Reply to comment by wagonmaker85 in [OC] How news stories evolve in the news cycle by PartisanPlayground

Can't argue with that, good take.

mikeholczer t1_j6b7fim wrote on January 29, 2023 at 2:10 AM

Reply to comment by kilopeter in Transition probabilities (shown as percentages) between successive letters in the names of girls born in 2021 in the USA [OC] by kilopeter

My point is your interpretation is flawed, because the most likely outcome of it is very far from the actual most likely name.

Recent comments in /f/dataisbeautiful