Shared memory parallelization under Ubuntu

HanieM · January 10, 2014, 3:35pm

Dear all,

I have just started using Palabos under Linux Ubuntu. I am trying to verify the parallelization by running the cavity3D benchmark example. I am running the example on my local PC which has Intel® Core™ i7-2600K CPU @ 3.40GHz × 8 cores.
I have set the proper flags in the make file, namely:

MPIparallel = false

Set SMP-parallel mode on/off (shared-memory parallelism)

SMPparallel = true

However, when I try to run the example using the following commands, I get:

mpirun -np 1 ./cavity3d 100
Starting benchmark with 101x101x101 grid points (approx. 2 minutes on modern processors).
Number of MPI threads: 1
After 291 iterations: 7.24897 Mega site updates per second

and

mpirun -np 2 ./cavity3d 100
Starting benchmark with 101x101x101 grid points (approx. 2 minutes on modern processors).
Starting benchmark with 101x101x101 grid points (approx. 2 minutes on modern processors).
Number of MPI threads: 1
Number of MPI threads: 1
After 291 iterations: 6.85924 Mega site updates per second.
After 291 iterations: 6.84047 Mega site updates per second,

which does not make sense to me. It seems that each core is executing an entire program. The execution time of the two cases are also similar. Could anyone tell me what might be the issue here? Any helps or suggestion is greatly appreciated.

Thanks,
Hanieh

Philippe · January 14, 2014, 9:44am

You switch off mpi parallelization, but then use mpirun to run your case. As no mpi commands are found in your case, mpirun just runs the same code twice… I do not know though how to properly use shared memory parallelization in palabos, would be interesting to know more about this.

HanieM · January 15, 2014, 11:15am

Dear Philippe,

Thanks for your response. The MPIparallel flag is meant to be for a cluster-like parallelization. I have however tried setting it to ‘true’ but in this case, I do not get any output, i.e., the simulation gets stuck on my local machine.

I also note that with the initial setup, using mpirun -np 2, I can get two cores involved. From the documentation, I see that using pcout should remove the dublicate print out, which is not the case here (not to mention no speedup gain).

Hanieh

dimitris · January 15, 2014, 11:48am

Dear all,

Palabos uses MPI for both shared memory and distributed memory parallelism. Therefore, for compilation for parallel execution you should use:

MPIparallel = true
SMPparallel = false

and execute the program, for example as:

mpirun -np 8 ./cavity3d 100

Best,
d

csturg · January 15, 2014, 12:00pm

You should be careful with cores and threads. The processor you specified has 4 cores, but it has 8 threads in total. For MPI and other such parallelisation methods you can only use the 4 cores for speed up. You will likely see a system slowdown when you use more than this because you are splitting the job up but using the same cores adding unnecisary communitcation.

Philippe · January 16, 2014, 11:52am

Dear Dimitris,

thank you for the clarification. Does that mean that shared memory parallelization in palabos is not truly “shared memory” like for example OpenMP or pthreads?

best
Philippe

dimitris · January 16, 2014, 12:27pm

Dear Philippe,

There is one “parallelization model” in Palabos, and this is based on the MPI interface. What your local version of MPI uses to implement shared memory, this is another story…

Best,
d

Philippe · January 16, 2014, 12:32pm

Ok, but what is the compiler flag “SMPparallel” then good for? Does setting it to “true” actually change anything if there is no underlying implementation that can exist without MPI?

dimitris · January 16, 2014, 12:50pm

The “SMPparallel” variable is there as a placeholder for our internal development. For all uses, the safe choice is to set it to “false”.

Best,
d

HanieM · January 21, 2014, 11:02am

Thanks again for all the feedbacks.

Now I have set :

MPIparallel = true
SMPparallel = false

and tried to compare the results to those from: http://wiki.palabos.org/plb_wiki:benchmark:cavity_n100

command: mpirun -np x ./caviry3d 100
x = 1,2,4

1 core: After 291 iteration: 7.28 Msu
2 cores: After 582 iteration: 5.37 Msu
4 cores: takes so long.

I have openMPI 1.4.3-2 installed. As I increaste the number of cores the execution time substantially increases which results in smaller Msu. Any idea why I am not getting the parallelization?

yann · January 21, 2014, 2:46pm

Hello HanieM,

that’s strange. Can you try to do a make clean and rebuild cavity3d ?

If this doesn’t help, can you post the output of [code=“make”]
cat /proc/cpuinfo



What is the version of Palabos you are using?

Thanks

HanieM · January 22, 2014, 9:13am

Hello Yann,

I did a make clean an rebuilt cavity3d, and I am still getting similar results to my previous post.

I am using palabos-v1.4r1 and here is the output of ‘cat /proc/cpuinfo’

[code=“make”]
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel® Core™ i7-2600K CPU @ 3.40GHz
stepping : 7
microcode : 0x25
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 6822.01
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel® Core™ i7-2600K CPU @ 3.40GHz
stepping : 7
microcode : 0x25
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 1
cpu cores : 4
apicid : 2
initial apicid : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 6822.01
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel® Core™ i7-2600K CPU @ 3.40GHz
stepping : 7
microcode : 0x25
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 2
cpu cores : 4
apicid : 4
initial apicid : 4
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 6822.01
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel® Core™ i7-2600K CPU @ 3.40GHz
stepping : 7
microcode : 0x25
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 6
initial apicid : 6
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 6822.01
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 4
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel® Core™ i7-2600K CPU @ 3.40GHz
stepping : 7
microcode : 0x25
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 1
initial apicid : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 6822.01
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 5
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel® Core™ i7-2600K CPU @ 3.40GHz
stepping : 7
microcode : 0x25
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 1
cpu cores : 4
apicid : 3
initial apicid : 3
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 6822.01
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 6
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel® Core™ i7-2600K CPU @ 3.40GHz
stepping : 7
microcode : 0x25
cpu MHz : 3401.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 2
cpu cores : 4
apicid : 5
initial apicid : 5
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 6822.01
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel® Core™ i7-2600K CPU @ 3.40GHz
stepping : 7
microcode : 0x25
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 6822.01
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:



Thanks,
Hanieh

yann · January 22, 2014, 10:25am

Hello, I have no idea what is going on then.

Are you running linux in a virtual machine? Your cpuinfo indicate that you have 8 cores, so it means you have hyperthreading enabled. You can try to disable it (in bios) and relaunch the benchmarks.

I guess you are not running out of memory and using swap?

Can you post the output of

[code=“make”]
ldd cavity3d



thanks

HanieM · January 22, 2014, 3:09pm

I disabled hyperthreading and I got the following results:

[code=“make”]
Number of MPI threads: 1
After 291 iterations: 7.30535 Mega site updates per second.

Number of MPI threads: 2
After 582 iterations: 7.34646 Mega site updates per second.



So the preformance got better with 2 cores but not enough. With 4 cores still takes a long time.

here is the output for 'ldd cavity3d':

[code="make"]
linux-gate.so.1 =>  (0xb76ee000)
libmpi_cxx.so.0 => /usr/lib/libmpi_cxx.so.0 (0xb76c1000)
libmpi.so.0 => /usr/lib/libmpi.so.0 (0xb7618000)
libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0xb7532000)
libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb7506000)
libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb74e8000)
libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0xb74cd000)
libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb7324000)
libopen-rte.so.0 => /usr/lib/libopen-rte.so.0 (0xb72d6000)
/lib/ld-linux.so.2 (0xb76ef000)
libopen-pal.so.0 => /usr/lib/libopen-pal.so.0 (0xb7282000)
libdl.so.2 => /lib/i386-linux-gnu/libdl.so.2 (0xb727d000)
libutil.so.1 => /lib/i386-linux-gnu/libutil.so.1 (0xb7279000)

I am not using any virtual machine or swap.

Thanks,
Hanieh

yann · January 22, 2014, 3:39pm

Can you check if you are using all the cores when you run cavity3d (using htop for example). Are they used at 100%?

Can you post the ouput of

[code=“make”]
cat /proc/cpuinfo | grep MHz


when you are running cavity3d with 4 cores (around 30 seconds after the launch).

HanieM · January 22, 2014, 4:38pm

With 1 core I see that 100% of the core is being used.
With 2 cores , they are only at 100% for a fraction of the total time, and for the rest it is highly volatile ranging from 5% to 95%. The bahavior seems to be random as well. I do not always get the same or similar Msu values. Sometimes it gives me values around 4 Msu with 2 cores.
With 4 cores, for the most part only 3 out of 4 cores are really at >50% at the same time, with only 2 and rarely 3 at 100% for a fraction of time. In general it is highly volatile and never at 100% for a good chunk of time.

For the 4-cores case the output of ‘cat’ after around 30 seconds is:

[code=“make”]
cpu MHz : 3401.000
cpu MHz : 3401.000
cpu MHz : 3401.000
cpu MHz : 3401.000



However, subsequent cats does not give me the same output all the time; sometimes with MHz as a low as half the above values.

HanieM · January 23, 2014, 11:19am

I noticed that I had a 32-bit version of ubuntu installed, whereas my processors can handle 64 bit. Here is what I did:

reinstalled a 64 bit Ubuntu 12.04
updated openMPI to 1.5.4

Now I am getting the expected parallelization (hyperthreading is enabled):

[code=“make”]
1 core : 7.74 Msu
2 cores: 13.47 Msu
4 cores: 19.96 Msu
8 cores: 16.82 Msu --> perhaps due to hyperthreading


 
I get similar values with hyperthreading disabled and using upto 4 cores.

Thanks a lot for taking the time to look into this. Really glad it is working now.

Best,
Hanieh

yann · January 23, 2014, 12:24pm

Hello, that’s great news.

Yes, it’s sure that you won’t see any benefit to use more than 4 cores. In fact in almost HPC clusters hyperthreading is disabled due to performance issue.

Enjoy Palabos