Staredit Network > Forums > Technology & Computers > Topic: Compute build
Compute build
Mar 3 2020, 8:53 pm
By: Vrael  

Mar 3 2020, 8:53 pm Vrael Post #1



So my wife runs some scientific computing code on this cluster at her university and basically it's poopy because there's a 5 day limit, I think it uses SLURM to do job scheduling, and basically its super annoying and hard to use. Besides helping her optimize the code, I also had the idea that maybe we could either drop some $$ on a small dedicated computer ourselves (or better yet, get her department to buy one), which we could then use. So, I've taken a stab at putting a build together, was interested in what you guys have to say about it.

Here are the requirements:

- No gfx card required, will basically just SSH into this machine (probably still install full ubuntu on it for ease of use and occasioanlly use a monitor but this can be a huge $$ saver. Actually we could use a GFX for some OpenGL rendering we do on the output, but let's pretend that's a problem for a different machine)
- Maximum number of CPUs. The code is parallelized and runs faster on more cores. On my i7-6700K (4 physical, 8 logical) it isn't feasible to run the datasets overnight, so in the build I included I picked the 16-core AMD. The point of this machine is to just churn on the data for a few days or weeks and not have to worry about arbitrary 5 day compute limits
- expandable RAM. 32GB is a good start (I mem-mapped a lot of the stuff so the memory does NOT scale with number of CPUs, but the buffer we write results to is still on the order of 14GB and possibly larger), if possible I'd like a board that we could put up to 128GB in
- at least 1 SSD. right now our largest input data file is 3.1 GB but I'm not sure what the future holds
- will be running some flavor of Linux (probably Ubuntu cause that's what I'm most familiar with). The code uses fork() setup for multiple processes within the python ecosystem and there's no fork() on Windows and Apple is retarded with their fork() implementation
- don't need periphereals (mouse/keyboard/monitor)
- price - no idea. I'm thinking $1500 is a pretty reasonable number to shoot for, especially to ask the department to pay for. There could be a huge upside to that number, or that could be way too much, don't know for sure, so lets say $1500 is the target.


Here is the starting build I came up with:

PCPartPicker Part List

CPU: AMD Ryzen 9 3950X 3.5 GHz 16-Core Processor ($747.95 @ Amazon)
CPU Cooler: Cooler Master Hyper 212 Black Edition 42 CFM CPU Cooler ($39.67 @ Amazon)
Motherboard: Asus TUF GAMING X570-PLUS (WI-FI) ATX AM4 Motherboard ($183.99 @ B&H)
Memory: G.Skill Ripjaws V 32 GB (1 x 32 GB) DDR4-3200 Memory ($149.99 @ Newegg)
Storage: Samsung 970 Evo 500 GB M.2-2280 NVME Solid State Drive ($87.99 @ Amazon)
Case: Corsair 200R ATX Mid Tower Case ($70.04 @ Amazon)
Power Supply: Corsair RM (2019) 650 W 80+ Gold Certified Fully Modular ATX Power Supply ($109.99 @ Corsair)
Total: $1389.62
Prices include shipping, taxes, and discounts when available
Generated by PCPartPicker 2020-03-03 15:52 EST-0500



None.

Mar 4 2020, 5:16 pm NudeRaider Post #2

We can't explain the universe, just describe it; and we don't know whether our theories are true, we just know they're not wrong. >Harald Lesch

Props for choosing the timeless classic that the CM Hyper 212 is. :D

And sorry for having to ask, but the potential gain is far to great to ignore: You have a task that is highly parallel, so is there no way to rewrite the code to use OpenCL or CUDA? It might be worth considerable effort as you can expect to speed up the computation by at least an order of magnitude, if not 2 AND get cheaper hardware.

If the answer is no, you picked the best CPU for the job. There's even faster CPUs but they'll overshoot your budget by A LOT.

I noticed you went with a single stick of RAM. This might be fine, or a big mistake. Your platform has a quad channel interface, which potentially makes your RAM throughput up to four times as fast. A single module has a transfer speed of ~25GB/s. So now you have to know how much that limits your program's workflow.
Maybe you can do a test run on 1 vs 2 modules and see how much faster it is. Maybe you can run programs that need less amount of RAM so you don't hit a limit with 1 module?

Without actually checking, I wonder if there's not a cheaper motherboard that does the job. Make sure to get good quality to have stable voltages and all the connectors you need. Everything else shouldn't matter.

Gold certification is a reasonable choice for the PSU considering you leave it running at high load for such a long time. You can also probably get away with a much lower PSU wattage if you want to save some bucks. PSU calculator recommends 426W including a GTX 1660 GPU. But keep in mind that PSUs are often most efficient around 60% of load so the higher wattage you chose may save you some electricity cost.

Post has been edited 3 time(s), last time on Mar 4 2020, 5:28 pm by NudeRaider.




Mar 4 2020, 6:36 pm Vrael Post #3



I don't personally know how to rewrite it in OpenCL or CUDA, which effectively means there is no way to rewrite it in those languages. Unfortunately, this algorithm is extremely intricate, and while personally I would re-write it with certain geometric simplifications, due to the legacy of previous research using the same algorithm we want to keep it in its weird intricate state (I've been telling my wife we should just do a stability analysis between the results of the current algorithm and a simplified version), so even if I knew OpenCL or CUDA it wouldn't just be a couple of nice simple kernels to help speedup the code.

The single stick of RAM was just to leave room for the up to 128GB that the MOBO offers. If I bought 2x16, and later we decided we needed the full 128GB we'd basically have to buy 4x32 and toss the 2x16's.



None.

Mar 5 2020, 4:30 pm NudeRaider Post #4

We can't explain the universe, just describe it; and we don't know whether our theories are true, we just know they're not wrong. >Harald Lesch

Quote from Vrael
Unfortunately, this algorithm is extremely intricate, and while personally I would re-write it with certain geometric simplifications, due to the legacy of previous research using the same algorithm we want to keep it in its weird intricate state (I've been telling my wife we should just do a stability analysis between the results of the current algorithm and a simplified version), so even if I knew OpenCL or CUDA it wouldn't just be a couple of nice simple kernels to help speedup the code.
I must admit I do not understand much what this means for the programming experience, so I'll just reply to this:
Quote from Vrael
I don't personally know how to rewrite it in OpenCL or CUDA, which effectively means there is no way to rewrite it in those languages.

I'm guessing you haven't tried to look into the matter because this suggests that it would be not too hard given you know any of the mentioned languages:
Quote from wikipedia
The OpenCL standard defines host APIs for C and C++; third-party APIs exist for other programming languages and platforms such as Python,[14] Java, Perl[15] and .NET.[11]:15 An implementation of the OpenCL standard consists of a library that implements the API for C and C++, and an OpenCL C compiler for the compute device(s) targeted.



Quote from Vrael
The single stick of RAM was just to leave room for the up to 128GB that the MOBO offers.
I realized that. I was pointing out that you are possibly limiting yourself NOW for a potential need later. And that you should check IF it would limit you now. When the need arises later you can still upgrade. A middle of the road approach is also possible: now 2x16 GB, later 2x32 GB.
tl;dr Don't underestimate the amount of data that needs to be fed to 32 threads.




Mar 5 2020, 8:37 pm Vrael Post #5



Lol let me put the OpenCL/CUDA issue in another light: $1300 is an order of magnitude (possibly two or more) cheaper than my time is to have to learn a new programming specification, and then convert the existing code (technically this is a false statement since I'm doing this work for my wife for free, but assuming I was being paid). I recently was able to write a visualization program (for the output of this code actually) in OpenGL, of which GLSL "graphics library software language" is also a C-like language, but it took months of effort. Would it be the better long term solution for my growth as a programmer to learn CUDA and convert this code? Probably. Would this code run faster on a GPU? I would guess, yes. Not 100% sure. But do I currently have too many things on my plate anyway and just want a lazy solution that doesn't require me to learn a whole new computing model? Yes. Am I resistant to the idea of converting it to something else because I've already sunk so much time optimizing the current architecture? Yes. :)

Also I still don't understand your point about the RAM. If it makes any difference, I believe the current code is CPU bottlenecked, not RAM bottlenecked, I use a lot of contiguous arrays so I imagine the pre-fetch ops are pretty easy for the CPU. Course I could try and check.... :D



None.

Mar 5 2020, 8:57 pm NudeRaider Post #6

We can't explain the universe, just describe it; and we don't know whether our theories are true, we just know they're not wrong. >Harald Lesch

kk. I'm convinced.

Quote from Vrael
Also I still don't understand your point about the RAM. If it makes any difference, I believe the current code is CPU bottlenecked, not RAM bottlenecked, I use a lot of contiguous arrays so I imagine the pre-fetch ops are pretty easy for the CPU. Course I could try and check.... :D
Well I have no idea what your code is doing, so I can't judge whether a RAM bottleneck is realistic or not. If you say it likely isn't then you won't hear any word from me again. But your argument was that it would limit the max size.

I only pointed it out because we're not used to even consider RAM to have a (significant) performance impact, but now that simultaneous threads are skyrocketing it becomes an issue again. Of course not for every workload, but still something to consider.




Mar 6 2020, 4:13 am Vrael Post #7



Yeah it's still a good point to consider (regarding the RAM). The closest empirical evidence would be that my i7-6600K runs individual threads on this code signifcantly faster than the compute cluster they use at the university, which I believe is an older Xeon 2.9GHz model. The CPU usage in the system monitor also stays capped at 100% CPU usage across all cores as well, though I'm not sure how accurate that is in terms of truly gauging whether the CPU is stalling at all due to waiting on memory or anything like that. Modern CPUs have huge L1/2/3 caches though, so I suppose I'm skeptical that faster RAM would do much.

The RAM max size issue is because I use a memory-mapped array for shared memory for the output of this code, which on the most recent dataset is 14GB, so the RAM just needs to be big enough (though technically I think with the right kernel settings, the whole thing never has to be fully in RAM at any one time so we could exceed the real RAM size).



None.

Mar 6 2020, 12:59 pm NudeRaider Post #8

We can't explain the universe, just describe it; and we don't know whether our theories are true, we just know they're not wrong. >Harald Lesch

Quote
though I'm not sure how accurate that is in terms of truly gauging whether the CPU is stalling at all due to waiting on memory
beats me as well. Benchmarks are usually the way to go to test something like that.




Mar 13 2020, 7:48 am ShadowFlare Post #9



I might not be back in a timely manner to reply to anything further, but I just wanted to ask. I noticed it's not in your list, so do you have at least a spare graphics card to put into this? With that CPU it might simply not boot without one. The ports on the motherboard itself are there only because some CPUs have integrated graphics, not that the motherboard itself has it.



None.

Mar 13 2020, 4:18 pm Vrael Post #10



You know, that's a good catch Shadowflare, I'll have to look into that further



None.

Mar 13 2020, 6:19 pm NudeRaider Post #11

We can't explain the universe, just describe it; and we don't know whether our theories are true, we just know they're not wrong. >Harald Lesch

She's right. There's no Ryzen 9s with integrated graphics. It would need to have a G after the number (e.g. Ryzen 5 2400G). I'm surprised that PCPartPicker didn't show that as a (compatibility) issue.

You can just plug in basically any PCIe card. Either a cheap new one or maybe you or friend has an old one lying around?




Mar 16 2020, 8:41 pm NudeRaider Post #12

We can't explain the universe, just describe it; and we don't know whether our theories are true, we just know they're not wrong. >Harald Lesch

How about 400 Threads? :w00t:




Mar 16 2020, 9:35 pm Vrael Post #13



The code doesn't benefit that much from additional logical cores, but 96 physical cores would certainly be better than 16! Now just need the price point on this bad boy lol. Didn't see it in the article, any idea what it might be?



None.

Options
  Back to forum
Please log in to reply to this topic or to report it.
Members in this topic: None.
[04:29 pm]
lil-Inferno -- u
[04:43 am]
Oh_Man -- Elden ring wins goty even though gow2 cleansweeps everything else. Wut
[04:18 am]
RIVE -- I miss me too
[10:43 pm]
Ultraviolet -- how do you miss us when you were only around for ~71 days over the last 15 years? :P
[10:43 pm]
Ultraviolet -- such an old account with barely any activity
[09:37 pm]
lil-Inferno -- I miss you too
[07:15 pm]
TEC_Ghost -- I miss you all.
[2022-12-07. : 12:30 pm]
NudeRaider -- the infamous "Jamba-Abo"
[2022-12-07. : 7:01 am]
O)FaRTy1billion[MM] -- i remember seeing it on TV
[2022-12-07. : 4:38 am]
RIVE -- "What was that?!? That was hot!"
Please log in to shout.


Members Online: peterfarge, Roy, savage2600, Allgiretriki, Oh_Man