Skip Navigation
163 comments
  • I manage a machine that runs both media transcodes and some video game servers.

    The video game servers have to run in real-time, or very close to it. Otherwise players using them suffer noticeable lag.

    Achieving this at the same time that an ffmpeg process was running was completely impossible. No matter what I did to limit ffmpegs use of CPU time. Even when running it at lowest priority it impacted the game server processes running at top priority. Even if I limited it to one thread, it was affecting things.

    I couldn't understand the problem. There was enough CPU time to go around to do both things, and the transcode wasn't even time sensitive, while the game server was, so why couldn't the Linux kernel just figure it out and schedule things in a way that made sense?

    So, for the first time I read up on how computers actually handle processes, multi-tasking and CPU scheduling.

    As FFMPEG is an application that uses ALL available CPU time until a task is done, I came to the conclusion that due to how context switching works (CPU cores can only do one thing, they just switch out what they do really fast, but this too takes time) it was causing the system to fall behind on the video game processes when the system was operating with zero processing headroom. The scheduler wasn't smart enough to maintain a real-time process in the face of FFMPEG, which would occupy ALL available cycles.

    I learned the solution was core pinning. Manually setting processes to run on certain cores of the CPU. I set FFMPEG to use only one core, since it doesn't matter how fast it completes. And I set the game processes to use all but that one core, so they don't accidentally end up queueing for CPU time on a core that doesn't have the headroom to allow the task to run within a reasonable time range.

    This has completely solved the problem, as the game processes and FFMPEG no longer wait for CPU cycles in the same queue.

  • cool, now find another distro

    • Sometimes .... usually I just hit a wall because I don't know enough but I know enough to get myself in trouble .... so I just stop, reformat, reinstall and start all over.

      About the biggest lesson I've learned from Linux is not to mess with too many things unless you want to learn about it and have lots of time in your hands.

      Otherwise if you find a good distro for your needs, a stick with it, don't change it, update and backup regularly.

  • Making a Palm Pilot getting a live connection to the internet through an infrared connection (Red Hat Linux). That was circa 2004, and I spent 10 hours, all night on it.

  • A couple months ago, I made a Palworld server box out of a spare motherboard assembly (mobo, processor, ram) from a computer I had recently upgraded.

    I didn't have any spare drives lying around, so I plugged in 7 USB flash drives and made them into a RAID array. Not a true RAID array, but a BTRFS filesystem with volumes spread onto each flash drive, with the data redundancy set to raid1, and the metadata redundancy set to raid1c3.

    It worked... in the sense that I never lost any data. It certainly didn't work in the sense of having good uptime.

    The first problem was getting it to boot right. The boot line in GRUB had "root=UUID=..." instead of a specific drive named. That is normal. However, in BTRFS multi-volume filesystems, all the volumes have the same UUID. So the initrd was only waiting for a single drive matching that UUID, then trying to mount it as the root filesystem. This failed, because the kernel had not yet set up the other 6 USB drives, and this BTRFS filesystem needs all 7 volumes present. Maybe 6, if you used the "degraded" mount option.

    The workaround was to wait for this boot process to fail, at which point you get dropped into an initrd shell. Then, you look at all the drives and make sure they're all there. And then... I don't exactly remember what happened next. I think it was some black magic that erases your mind in the process. I somehow got it booted from the initrd shell.

    Installing Steam and the Palworld server worked ok, and it even ran for a few hours before crashing overnight.

    The next morning, I tried rebooting it. Unfortunately, the USB drives weren't all appearing. Turns out the motherboard had some bad USB ports, some sometimes-bad USB ports, and a maybe-bad PCIe bus, because the PCIe USB expansion card I plugged in had weird problem that it had never had before.

    I found the most reliable ports and plugged the drives in there. But you can't just replug them in the initrd. It doesn't have USB hotplug support. So each time it tried to boot with not all the drives there, I restarted it again until one time I finally had all the drives.

    I changed the GRUB boot line to "root=/dev/sdg1" . This made it wait for all the drives to load, in any order, and whichever one was last would be mounted as the root filesystem (but the kernel would automatically include all the others too, since they were successfully initialized).

    The bad USB ports kept bringing down the server every day or two. I bought a cheap NVMe drive and added it to the BTRFS filesystem, and then removed all the USB drives except the largest. That fixed the reliability. It's been like that since.

    Now, to boot the server, all I have to do is change the GRUB boot line to "root=/dev/sdb1" . Since the NVMe drive is much faster than the USB drive, it always initializes first. If the initrd waits for sdb2, then it will always have both drives initialized when it tries to mount the root filesystem.

    I could add that to the grub.cfg, or come up with some other more permanent solution, but I'm not planning on rebooting this server ever again. My friends fell off Palworld, and I gave a shutdown date that's about a week away. And the electricity is pretty reliable here.

  • More than a decade ago a user came into #ubuntu-server on Freenode (now libera.chat ) and said that they had accidentally run "rm -rf /* something*" in a root shell.

    Note the errant space that made that a fatal mistake. I don't remember how far it actually got in deleting files, but all of /bin/ /sbin/ and /usr/ were gone.

    He had 1 active ssh connection, and couldn't start another one.

    It was a server that was "in production", was thousands of miles away from him, and which had no possibility for IPMI / remote hands.

    Everyone (but me) in the channel said that he was just SoL and should just give up.

    I stayed up most of the night helping him. I like challenges and I like helping people.

    This was in the sysv-init (maybe upstart) days, and so a decent number of shell scripts were running, and using basic *nix commands.

    We recovered the bash binary by running something along the lines of

     
        
    bash_binary_contents="$( </proc/self/exe)"
    printf "%s" > /tmp/bash
    
      

    (If you can access "lsof" then "sudo lsof | grep deleted" will show you any files that are open, but also "deleted". You may be surprised at how many there are!)

    But bash needed too many shared libraries to make that practical.

    Somehow we were able to recover curl and chmod, after which I had him download busybox-static. From there we downloaded an Ubuntu LiveCD iso, loop mounted it, loop mounted the squashfs image inside the iso, and copied all of /bin/ , /sbin/ , /etc , and so on from there onto his root FS.

    Then we re-installed missing packages, fixed up /etc/ (a lot of important daemons, including the one that was production critical, kept their configuration files open, and so we were able to use lsof to find the magic symlinks to them in /proc/$pid/fd/ and just cp them back into /etc/.

    We were able to restart openssh-server, log in again, and I don't remember if we were brave enough to test rebooting.

    But we fucking did it!

    I am certainly getting a lot of details wrong from memory. It's all somewhere at irclogs.ubuntu.com though. My nick was / is Jordan_U.

    I tried to find it once, and failed.

  • So I mostly fried the SSD by using it to write and rewrite ML checkpoints and logs, this in turn made the device read only and I somehow managed to migrate to a different SSD probably using clonezilla or something, but it messed up the bootloader so I installed refind in a new partition, configured it and voila it works. It's scary because you need to do everything without seeing your system even half alive anywhere along the process, but it's not actually hard, just copying data and installing/configuring a bootloader. But for a then 20year old at his more or less first job my head was on fire for the 1.5 days this took.

    By far the most difficult single thing that I've ever had to fix that actually had to do with the system.

    I now don't flood my SSDs with data that is constantly rewritten.

  • This will feel extremely simple for some folks, but I was having a hell of a time getting Steam games that had previously worked through Proton running. I scoured the internet for solutions after trying to install proton-ge and testing multiple versions. Eventually someone had the galaxy brain idea to suggest installing WINE. For some reason, that fixed the problem real good.

  • Learned how drivers worked and fixed a driver for an USB to I2C chip. It's still buggy but at least it sorta works now.

    Some more details: I was using a CH347 (USB to UART/SPI/I2C) and there was an open source driver that used a previous chip version. The original dev had hardcoded the bulk IO endpoints indices. The only change I had to do was just iterate over the endpoints and search for the correct ones. But at first, I didn't understand anything about how the USB subsystem worked and how drivers were loaded. All I could tell was the USB device was correctly detected but the I2C driver wasn't being loaded, despite proper udev rules, correct vendor/product IDs, etc.

  • Can't think of the most difficult problem, but I have managed to solve a lot of problems with btrfs snapshots.

  • I've generally had good luck with hardware and things just worked under linux. But one day I upgraded a few machines on my network to 2.5G ethernet. Several already had the ports, but my little NUC NAS box didn't, so I installed a 2.5G usb ethernet dongle. No matter what I did, I couldn't get it to work. It would show up and NM would act like it was up and there were no errors or anything, but it just wouldn't actually function.

    Eventually, I found out that it has a built in USB data partition that contains the drivers for windows. The card was coming up as a usb disk first when the hardware was assigned and not a network card which it should have been.

    I had to write a blacklist the usb modules first, which I had done before, but I had to also write a udev rule to automatically add the network card and driver on boot. It wasn't that difficult to actually do, but I had just never had to do anything with udev rules before. Took me a good three days of troubleshooting to finally get everything to work correctly on boot.

    ACTION=="add", ATTRS{idVendor}=="20f4", ATTRS{idProduct}=="e02c", RUN+="/sbin/modprobe r8152" RUN+="/bin/sh -c 'echo 20f4 e02c > /sys/bus/usb/drivers/r8152/new_id'"

  • while playing around with face/fingerprint unlock for my laptop, I messed up pam (Linux Pluggable Authentication Modules) and no passwords were working anymore except for the root account. At first I was still on my account, but then I stupidly rebooted and could only log in as root. After so many config edits, I gave up and instead booted up windows (my laptop's dual booted), setting up a new linux install in VirtualBox, and then copying over the PAM config files from the vm to the actual Linux install.

    and it all somehow worked!

    I am now facing another issue which I'm gonna say here in the hope somebody has already ran into it: after updating to KDE plasma 6, tap to click works on my touchpad, but actually, physically, pressing on the trackpad doesn't work. I can hear the pad's physical clicking noise, but nothing happens os wise

    this one's still to be resolved

  • I can't remember the details anymore, but for a year or two I had a bad run of absolutely hosing my boot config and leaving myself in a state where the system either couldn't find it's kernel or couldn't find the root partition and would drop me into an initramfs emergency shell. I got pretty good at booting into a live environment, getting all my dm-raid and lvm disks discovered, mounting all the relevant file systems in the right place, chrooting in and rebuilding the pieces that were broken

  • Biggest thing I noticed after switching is forum posts. In Linux ones you usually get a fix where the Windows ones 9/10 they just advise you reformat

  • At some point I've installed rust implementation of the coreutils from the AUR, they worked for a long while until some ssl vulnerability were discovered and everyone had to update the library. As you can imagine, without working coreutils system were hard to use. troubleshooting were also a pain in the ass because who could blame coreutils of all things? :P

163 comments