Because we run individual nodes without swap space, memory demands must be met locally. To do this set the Linux tuning parameters, overcommit_memory and overcommit_ratio, as follows. (For parameter details, see the man page.)
which on our systems, prevents committing over 90% of physical RAM. Malloc will fail if your application exceeds this limit, even if it does not touch all of the memory malloced.
The oom killer tries to preserve the system by killing off applications, but it doesn’t always get the right one, and it may leave a node unusable. A node that becomes unusable does not affect the operation of the other nodes in the system because the unusable node’s fabric switch and links continue to operate normally.
Typically, when an oom occurs, the console log of an affected node contains messages, such as:
oom-killer: gfp_mask=0xd0, order=0
When a node becomes unresponsive, check the tail of the Linux console log file for messages. The console log file is located on the SSP in
/var/log/<partition>/<partition>-<module>n<node>.console.
The overcommit_memory and overcommit_ratio parameters specify if and how to over commit physical memory.
• overcommit_memory = <0|1|2>
0 — Root allowed to heuristically over allocate memory slightly, but any obvious over commitment is refused.
1 — Always allow applications to over commit physical memory. Useful for some scientific applications, which allocate large amounts of memory, but don't actually touch all of the allocated pages.
2 — Never allow over commitment of memory. Refuse any request greater than overcommit_ratio = ## % of physical RAM. In these cases, malloc will fail.
• overcommit_ratio = <##>
The percentage of physical memory the application is allowed to commit when overcommit_memory = 2.
You can change the setting of the overcommit parameters two ways.
Edit the vm.overcommit_memory = and vm.overcommit_ratio = parameters in the /opt/sicortex/rootfs/default/etc/sysctl.conf file, then reboot the System.
As root, reset the overcommit parameters at runtime, for example:
srun -p <partition> -N <all> bash -c "echo <0|1|2> > /proc/sys/vm/overcommit_memory"
srun -p <partition> -N <all> bash -c "echo <value> > /proc/sys/vm/overcommit_ratio"
If you allow over committing memory and see processes killed due to oom errors, umount then mount the Lustre file system to regain the file system space used by those processes.
Feedback
Log in or create a user account to add feedback.