Recent comments in /f/deeplearning

beingsubmitted t1_iqvs8ix wrote

All you need to know over time is the pitch being played, which is a frequency. The audio file represents a waveform, and all you need to know is the frequency of that waveform over time. There's no need for anything sequential. 440Hz is "A" no matter where it comes in a sequence. It's A if it comes after C, and it's A if it comes after F#.

A sequential model might be useful for natural language, for example, because meaning is carried between words. "Very tall" and "Not Tall" are different things. "He was quite tall" and "No one ever accused him of not being tall" are remarkably similar things. Transcribing music is just charting the frequency over time.

That said, you cannot get the frequency from a single data point, so there is a somewhat sequential nature to things, but it's really just that you need to transform the positions of the waveform over time into frequency, which the fourier transform does. When music visualizations show you an EQ (equalizer) chart to go with your music, this is what they're doing - showing you how much of various frequencies are present at a given time in the music, using a FFT. A digital equalizer similarly transforms audio into a frequency spectrum, allows you to adjust the frequency spectrum, and then transforms back into a waveform.

2

camaradorjk OP t1_iqvje9f wrote

Thank you so much for taking the time to answer my question. You're right on the first one, my goal is to transcribe music or flute music into notes. But I'm a little confused about why there's no need to use a deep learning model because I thought in the first place that I could also use sequential models. Could you elaborate on that for me? Thank you so much.

PS:I will surely look into your recommendation about FFT.

1

beingsubmitted t1_iqvdo7x wrote

I'm a little unclear - there are three different things you might be trying to do here. The first would be transcription - taking an audio file and interpreting it into notes. That wouldn't typically require deep learning on it's own, just a fourier transform. The second would be isolating a specific instrument in an ensemble - finding just the recorder in a collection of different instruments all playing different things. The third would be generation, inferring unplayed future notes based on previous notes.

Are you wanting to transcribe, isolate, generate, or some combination?

I'm thinking you're wanting to transcribe. If that's the case, FFT (fast fourier transform) would be the algo to choose. If you google "FFT music transcription" you'll get a lot of info. https://ryan-mah.com/files/posts.amt_part_2.main.pdf

3

sydjashim t1_ique162 wrote

I have got a quick guess here.. maybe can be of help to you.. take the n-1 layers weights of your first learned model (trained weights) then try finetuning with the 4 outputs and observe either your validation loss is improving.

If so, then later you can take the untrained initial weights of your first model (till n-1th layer) then trying converging them with 4 outputs. This step is mentioned such that you have got a model started training from scratch for 4 outputs but having the same initial weights for both the models.

Why am i saying this ?

Well. I think you could try in this way since you expect to keep maximum params esp. model parameters (weights) similar while running the comparision between them.

2

Chigaijin t1_iqu05wy wrote

Razer makes them with input from Lambda Labs, a deep learning company that hosts cloud GPUs and supports onsite builds as well. Lambda provides support and have been very helpful the few times I've needed to reach them. Base model (Linux) is $3.5k with a year warranty, $4.1k for the same with a 2 year warranty, and $5k for a dual Linux/windows machine that has a 3 year warranty. All machines have the same specs, so it's really support/warranty you're paying for.

2

thebear96 t1_iqtxrnx wrote

Well I assumed that the network had more layers and so more parameters. More parameters can represent data much better and quicker. For example if you had a dataset with 30 features, and you use a Linear layer with 64 neurons, it should be able to represent each data point much quicker and easier than let's say a linear layer with 16 neurons. That's why I think the model would get converged quicker. But in OPs case his hidden layers are the same, only the output layer has more neurons. In that case we won't have a quick convergence.

1

PleaseKillMeNowOkay OP t1_iqscxo9 wrote

The simpler model had lower training loss with the same number of epochs. I tried training the second model until it had the same training loss as the first model, which took much longer. The validation did not improve and had a slight upward trend, which I know means that it's overfitting.

1

Best_Definition_4385 t1_iqr6aca wrote

>The Tensorbook is only $3500 unless you're looking at the dual boot model

This is not a comment about you, it's just a general comment. Considering the fact that setting up a dual boot system takes minimal time and expertise, if someone decides to spend $500 for a dual boot model, do you really think they have the computer skills that would need a powerful laptop? I mean, they can't figure out how to dual boot but they want to do deep learning? lmao

3

thebear96 t1_iqr04o9 wrote

Ideally it should. In that case you will have a worse performance for the second architecture. When you compare you'll have to say that. But it's pretty expected that the second architecture will not perform as well as the first one, so I'm not sure if there's much use comparing. But it's definitely doable.

2