rajatarya
Submitted by rajatarya t3_10a4mns in MachineLearning
rajatarya OP t1_j0jx9cw wrote
Reply to comment by _matterny_ in [P] XetHub: We scaled Git to support 1 TB repos by rajatarya
No specific file limit. By scanning the files and chunking them the specific number of files in the repo doesn’t matter. But for each file in the repo we leave a pointer file that references the merkle tree entry for that file.
rajatarya OP t1_j0ij9if wrote
Reply to comment by ZorbaTHut in [P] XetHub: We scaled Git to support 1 TB repos by rajatarya
Would love to talk more. DM me your work email and I will follow up to set up time. We have heard of this use case and some of us (myself included) have used Perforce ~20 years ago.
One thing I would love to learn more about is the expected overall workflow. Meaning, what do game development teams expect their workflow to be? How does XetHub (or any other tool for code & asset management) fit into that workflow?
rajatarya OP t1_j0iemk4 wrote
Reply to comment by BossOfTheGame in [P] XetHub: We scaled Git to support 1 TB repos by rajatarya
There isn’t a hard limit at 1TB currently. The main thing is the experience / performance may degrade. The size of the merkle tree is roughly 1% of total repo size so at 1TB even downloading that can take some time. You can definitely use XetHub past 1TB repo today - but your mileage may vary (in terms of perf/experience).
To avoid downloading the entire repo you can use Xet Mount today to get a file system readonly view of the repo. Or use the —no-smudge flag on clone to simply get pointer files. Then call git xet checkout for the files you want to hydrate.
I would love to talk more about the 2TB DVC repos you are using today - and believe they would be well served by XetHub. Something I would be eager to explore. DM me your email if interested and I will follow up.
Thanks for the question!
rajatarya OP t1_j0h8m7z wrote
Reply to comment by jakethesnake_ in [P] XetHub: We scaled Git to support 1 TB repos by rajatarya
Great, can't wait to hear your feedback once you've gotten back to work in the new year!
We definitely can do a dedicated (single-tenant) deployment of XetHub. That way your data stays in your environment for its entirety. It also means you can scale up or down the caching/storage nodes to meet the throughput needs for your workloads.
Yes, we built mount with the data center use case in mind. We have seen how distributed GPU clusters are at 3-5% utilization as they are sitting around idle while downloading data. With mount those GPUs get busy right away, we have seen 25% reductions in 1st epoch training time.
Small clarification - we store the Merkle Tree in the Git repo, in a Git notes database - so that lives with the repo. The only thing we store outside the repo are the ~16MB data blocks that represent the files in the repo that are managed by XetHub.
I would also love to hear about the data governance requirements for your company. Those can help us plan what features we need to add to our roadmap. Can you DM me your work email so I can follow up in January?
rajatarya OP t1_j0h7npz wrote
Reply to comment by Liorithiel in [P] XetHub: We scaled Git to support 1 TB repos by rajatarya
True :) I haven't used `git annex` myself so for me it felt like _finally_ when I could put all parts of the project in one place with XetHub.
How do you like using git annex? Are you working with others on your projects - does git annex help support team collaboration?
Again, appreciate the comment!
rajatarya OP t1_j0h532a wrote
Reply to comment by tlklk in [P] XetHub: We scaled Git to support 1 TB repos by rajatarya
Yes, you can keep data entirely remotely. We built Xet Mount specifically for this - just mount the repo to get a virtual filesystem view over the repo. We stream the files in the background and on-demand. Or you can clone the repo with --no-smudge and just have pointer files. Then you can choose which files to hydrate (smudge) yourself.
Comparing to DVC, we have a handy feature comparison available here: https://xetdata.com/why-xethub. The short answer is DVC requires registration of which files for it to track and does file-based deduplication by simply storing the files in a remote location. This means if 1MB of a 500MB file changed daily, with DVC/Git LFS every day all 500MB would have to be uploaded/downloaded. With XetHub only around ~1MB would have to be uploaded/downloaded daily.
Are you using DVC currently? Would love to hear more about your experience using it and have you try XetHub instead.
rajatarya OP t1_j0h40s0 wrote
Reply to comment by jakethesnake_ in [P] XetHub: We scaled Git to support 1 TB repos by rajatarya
Tell me more about this. Are you are looking to push your data to S3 and then have XetHub ingest it automatically from S3? Or that you would like to keep your data in S3 and then have XetHub work with your data stored in S3 in-place?
We are planning on building the first one (automatic ingestion from S3) - it is on our roadmap for 2023.
Since XetHub builds a Merkle tree over the entire repo we don't actually store the files themselves - instead we store data blocks that are ~16MB chunks of the files. This allows us to efficiently transfer data while still providing fine-grained diffs. That means the files you store in S3 aren't represented in the same way in XetHub - so we cannot manage S3 files in-place. Instead we need to chunk them and build the Merkle tree so we can deduplicate the repo and store it efficiently.
Why would you want to be responsible for your own S3 buckets and files and then have XetHub manage things from there?
rajatarya OP t1_j0h397b wrote
Reply to comment by hughperman in [P] XetHub: We scaled Git to support 1 TB repos by rajatarya
Thank you for sharing your concerns. We offer on-prem/private-cloud as a deployment option, specifically to help address some of these concerns. Meaning, we can deploy a single-tenant deployment of XetHub into your cloud environment (VPC) today. That should help with anyone geographically located in one place that isn't near our current deployment.
For teams that are globally distributed we offer cache clusters to allow for scale out and improved throughput, while minimizing costs.
I would love to hear more about your concerns - we are just getting started so lots more to come in the coming months!
rajatarya OP t1_j0gzh2r wrote
Reply to comment by rajatarya in [P] XetHub: We scaled Git to support 1 TB repos by rajatarya
Oh I forgot to mention - yes! mapping model to training data is a key part of reproducibility. 100% agree!
Using XetHub you can _finally_ commit the data, features, models, and metadata all in one place (along with the code). Have full confidence everything is aligned & working.
rajatarya OP t1_j0gyotd wrote
Reply to comment by Keepclamand- in [P] XetHub: We scaled Git to support 1 TB repos by rajatarya
Great questions. Definitely check us out - within 15m of getting started you'll experience answers to your questions :)
- 
Do you support all data types? 
 Yes, all file types are supported. The level of deduplication we can achieve varies by file type (some file types are already compressed) but all file types are supported. We have some great example repos with images, text, and other data types.
- 
Can you track versioning of data? 
 Yes, since you are just using Git - each commit captures the version of the data (since the data is just files in the repo). This way you have full collaboration features of Git while having full reproducibility. With the added benefit of having confidence that the code will work with the data at each commit.
- 
Do you have APIs? 
 Not today. Can you tell me what sort of APIs would be interesting to you? We built Xet Mount specifically for use cases when you don't want to download the entire repo - instead you mount it and get a filesystem view over the repo and stream in the files you want to explore/examine/analyze.
Do check out XetHub - I would love to hear your feedback!
rajatarya OP t1_j0gxo3b wrote
Reply to comment by kkngs in [P] XetHub: We scaled Git to support 1 TB repos by rajatarya
We are still early in our thinking of business model - so would love to hear your thoughts on this.
In general, we are thinking about usage-based pricing based on compute, storage, and transfer.
Right now we offer a cloud-based multi-tenant service. We can also deploy into your cloud environment (VPC) as a single-tenant offering.
I would love to hear more about the use case you are thinking about - please DM me to talk more about it (and to hear more details on single-tenant offering).
rajatarya OP t1_j0gvprn wrote
Reply to comment by kkngs in [P] XetHub: We scaled Git to support 1 TB repos by rajatarya
Great question! There isn't a limit on single file size. Since we chunk each file into blocks the total size of the file isn't a limiting factor. Right now overall repo sizes can be up to 1TB but we have plans to scale that to 100TB in the next year.
Would love it if you could try out XetHub and share your thoughts after using it a bit.
rajatarya OP t1_j0gtsmb wrote
Reply to comment by Retarded_Rhino in [P] XetHub: We scaled Git to support 1 TB repos by rajatarya
Thanks! Try out XetHub and tell me what you think of it - would love to know your thoughts after using it a bit.
Submitted by rajatarya t3_znfgap in MachineLearning
rajatarya OP t1_j43uhw1 wrote
Reply to comment by BossOfTheGame in [R] Git is for Data (CIDR 2023) - Extending Git to Support Large-Scale Data by rajatarya
Does the Community edition of XetHub help address this? See here: https://xetdata.com/pricing/. Everyone today gets 20GB of storage for free.