Difference between revisions of "DoLoops performance in Fortran"
From MohidWiki
(→Test-case) |
|||
| Line 4: | Line 4: | ||
* that changing the chunk size or alternating between static and dynamic scheduling yield less than 10% differences in performance. Best performance is obtained, for this test-case, with a small dynamic chunk. | * that changing the chunk size or alternating between static and dynamic scheduling yield less than 10% differences in performance. Best performance is obtained, for this test-case, with a small dynamic chunk. | ||
| − | ==Test- | + | ==Test-cases== |
| − | ===Description=== | + | |
| + | ===Simple triple do-loop=== | ||
| + | |||
| + | ====Description==== | ||
Performs a triple-do-loop with a simple computation over a cubic matrix. Size is chosen by the user through standard input. | Performs a triple-do-loop with a simple computation over a cubic matrix. Size is chosen by the user through standard input. | ||
| − | ===Hardware=== | + | ====Hardware==== |
*Intel Core i7 - 870 | *Intel Core i7 - 870 | ||
*8 GB Ram | *8 GB Ram | ||
| − | ===Code=== | + | ====Code==== |
* The main program | * The main program | ||
program DoloopsOpenmp | program DoloopsOpenmp | ||
| Line 108: | Line 111: | ||
end module moduleDoloopsOpenmp | end module moduleDoloopsOpenmp | ||
| − | ===Results=== | + | ====Results==== |
<webimage>http://content.screencast.com/users/GRiflet/folders/Jing/media/49c2beae-32f4-4da0-96b1-ddae90bd2d02/2010-10-21_1735.png</webimage> | <webimage>http://content.screencast.com/users/GRiflet/folders/Jing/media/49c2beae-32f4-4da0-96b1-ddae90bd2d02/2010-10-21_1735.png</webimage> | ||
*Full results | *Full results | ||
| Line 228: | Line 231: | ||
600 4.00 | 600 4.00 | ||
| − | ===Conclusions=== | + | ====Conclusions==== |
* do(k,j,i) Vs do(i,j,k) ==> 2 to 4 times faster! | * do(k,j,i) Vs do(i,j,k) ==> 2 to 4 times faster! | ||
* dynamic small chunks, or no chunk at all yield 10% increased performance over large dynamic chunks. Probably better off with no-chunk. | * dynamic small chunks, or no chunk at all yield 10% increased performance over large dynamic chunks. Probably better off with no-chunk. | ||
* More test-cases representing different scenarios of do-loops may yield different choices of CHUNK/scheduling. | * More test-cases representing different scenarios of do-loops may yield different choices of CHUNK/scheduling. | ||
| + | |||
| + | === SetMatrix3D_Constant=== | ||
| + | ====Description==== | ||
| + | This subroutine is in the ModuleFunctions of MohidWater. In the context of MohidWater, parallelizing this subroutine yields a worse performance. However, in the little test program, the same OMP directives yield quite good results. | ||
| + | ====Hardware==== | ||
| + | *Core i7-870 | ||
| + | *8 GB RAM | ||
| + | ====Code==== | ||
| + | |||
| + | real function SetMatrixValues3D_R8_Constant (Matrix, Valueb, MapMatrix) | ||
| + | |||
| + | !Arguments------------------------------------------------------------- | ||
| + | real, dimension(:, :, :), pointer :: Matrix | ||
| + | real, intent (IN) :: Valueb | ||
| + | integer, dimension(:, :, :), pointer, optional :: MapMatrix | ||
| + | |||
| + | !Local----------------------------------------------------------------- | ||
| + | integer :: i, j, k | ||
| + | integer :: ilb, iub, jlb, jub, klb, kub | ||
| + | |||
| + | !Begin----------------------------------------------------------------- | ||
| + | |||
| + | ilb = lbound(Matrix,1) | ||
| + | iub = ubound(Matrix,1) | ||
| + | jlb = lbound(Matrix,2) | ||
| + | jub = ubound(Matrix,2) | ||
| + | klb = lbound(Matrix,3) | ||
| + | kub = ubound(Matrix,3) | ||
| + | |||
| + | !griflet: omp slowdown | ||
| + | if (present(MapMatrix)) then | ||
| + | !$OMP PARALLEL DO PRIVATE(i,j,k) | ||
| + | do k = klb, kub | ||
| + | do j = jlb, jub | ||
| + | do i = ilb, iub | ||
| + | if (MapMatrix(i, j, k) == 1) then | ||
| + | Matrix (i, j, k) = Valueb | ||
| + | endif | ||
| + | enddo | ||
| + | enddo | ||
| + | enddo | ||
| + | !$OMP END PARALLEL DO | ||
| + | else | ||
| + | !$OMP PARALLEL DO PRIVATE(i,j,k) | ||
| + | do k = klb, kub | ||
| + | do j = jlb, jub | ||
| + | do i = ilb, iub | ||
| + | Matrix (i, j, k) = Valueb | ||
| + | enddo | ||
| + | enddo | ||
| + | enddo | ||
| + | !$OMP END PARALLEL DO | ||
| + | endif | ||
| + | |||
| + | SetMatrixValues3D_R8_Constant = sumMatrix3D(Matrix) | ||
| + | |||
| + | end function SetMatrixValues3D_R8_Constant | ||
| + | |||
| + | ====Results==== | ||
| + | <webimage>http://content.screencast.com/users/GRiflet/folders/Jing/media/136f62db-a24f-4315-9eae-f4bcd9110d4e/2010-10-22_1553.png</webimage> | ||
| + | |||
[[Category:programming]] | [[Category:programming]] | ||
[[Category:fortran]] | [[Category:fortran]] | ||
Revision as of 15:58, 22 October 2010
What is the best performance that Fortran can give when computing do-loops over large matrices? A single triple-do-loop test-case was implemented with matrix size ranging from 1M to 216M four-byte units. The test-case below shows:
- a 400% performance increase when looping with the order (k,j,i) instead of looping with the order (i,j,k),
- that a 300% performance increase occur when using openmp directives in a quad-core processor (i7-870),
- that changing the chunk size or alternating between static and dynamic scheduling yield less than 10% differences in performance. Best performance is obtained, for this test-case, with a small dynamic chunk.
Contents
Test-cases
Simple triple do-loop
Description
Performs a triple-do-loop with a simple computation over a cubic matrix. Size is chosen by the user through standard input.
Hardware
- Intel Core i7 - 870
- 8 GB Ram
Code
- The main program
program DoloopsOpenmp
use moduleDoloopsOpenmp, only: makeloop
implicit none
integer, dimension(:,:,:), pointer :: mycube
integer :: M = 1
real :: elapsedtime
real :: time = 0.0
do while (M < 1000)
write(*,*) 'Insert the cube size M (or insert 1000 to exit): '
read(*,*) M
if (M > 999) exit
allocate(mycube(1:M,1:M,1:M))
!Tic()
time = elapsedtime(time)
call makeloop(mycube)
!Toc()
time = elapsedtime(time)
write(*,10) time
write(*,*)
deallocate(mycube)
nullify(mycube)
end do
10 format('Time elapsed: ',F6.2)
end program DoloopsOpenmp
!This function computes the time
real function elapsedtime(lasttime)
integer :: count, count_rate
real :: lasttime
call system_clock(count, count_rate)
elapsedtime = count * 1.0 / count_rate - lasttime
end function elapsedtime
- The module
module moduleDoloopsOpenmp
use omp_lib
implicit none
private
public :: makeloop
contains
subroutine makeloop(cubicmatrix)
!Arguments --------------
integer, dimension(:,:,:), pointer :: cubicmatrix
!Local variables --------
integer :: i, j, k, lb, ub
lb = lbound(cubicmatrix,3)
ub = ubound(cubicmatrix,3)
!$OMP PARALLEL PRIVATE(i,j,k)
!$OMP DO
do k = lb, ub
do j = lb, ub
do i = lb, ub
cubicmatrix(i,j,k) = cubicmatrix(i,j,k) + 1
end do
end do
end do
!$OMP END DO
!$OMP END PARALLEL
end subroutine makeloop
end module moduleDoloopsOpenmp
Results
- Full results
- Looking only at the results with STATIC/DYNAMIC/CHUNK variations.
---------------------------- DO (i,j,k) / NO CHUNK ---------------------------- Table A.1 - Debug do(i,j,k) Size Time 100 0.04 200 0.37 300 1.58 400 7.60 500 19.66 600 41.65 Table A.2 - Debug openmp without !$OMP PARALLEL directives do(i,j,k) Size Time 100 0.04 200 0.37 300 1.58 400 7.27 500 19.34 600 41.34 Table A.3 - Debug openmp with one !$OMP PARALLEL DO directive do(i,j,k) Size Time 100 0.02 200 0.19 300 0.70 400 1.86 500 4.05 600 7.83 ---------------------------- DO (k,j,i) / NO CHUNK ---------------------------- Table B.1 - Debug do(k,j,i) Size Time 100 0.04 200 0.31 300 1.22 400 3.36 500 7.55 600 14.88 Table B.2 - Debug openmp without !$OMP PARALLEL directives do(k,j,i) Size Time 100 0.04 200 0.31 300 1.21 400 3.36 500 7.82 600 15.07 Table B.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) Size Time 100 0.02 200 0.09 300 0.36 400 0.94 500 2.04 600 3.89 ---------------------------- DO (k,j,i) / STATIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1 ---------------------------- Table C.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) Size Time 100 0.02 200 0.15 300 0.42 400 1.03 500 2.12 600 3.97 ---------------------------- DO (k,j,i) / STATIC CHUNK = 10 ---------------------------- Table D.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) Size Time 100 0.02 200 0.16 300 0.43 400 1.04 500 2.18 600 4.05 ---------------------------- DO (k,j,i) / DYNAMIC CHUNK = 10 ---------------------------- Table E.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) Size Time 100 0.01 200 0.10 300 0.36 400 0.93 500 2.01 600 3.89 ---------------------------- DO (k,j,i) / DYNAMIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1 ---------------------------- Table F.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) Size Time 100 0.02 200 0.09 300 0.39 400 1.04 500 2.14 600 4.00
Conclusions
- do(k,j,i) Vs do(i,j,k) ==> 2 to 4 times faster!
- dynamic small chunks, or no chunk at all yield 10% increased performance over large dynamic chunks. Probably better off with no-chunk.
- More test-cases representing different scenarios of do-loops may yield different choices of CHUNK/scheduling.
SetMatrix3D_Constant
Description
This subroutine is in the ModuleFunctions of MohidWater. In the context of MohidWater, parallelizing this subroutine yields a worse performance. However, in the little test program, the same OMP directives yield quite good results.
Hardware
- Core i7-870
- 8 GB RAM
Code
real function SetMatrixValues3D_R8_Constant (Matrix, Valueb, MapMatrix)
!Arguments-------------------------------------------------------------
real, dimension(:, :, :), pointer :: Matrix
real, intent (IN) :: Valueb
integer, dimension(:, :, :), pointer, optional :: MapMatrix
!Local-----------------------------------------------------------------
integer :: i, j, k
integer :: ilb, iub, jlb, jub, klb, kub
!Begin-----------------------------------------------------------------
ilb = lbound(Matrix,1)
iub = ubound(Matrix,1)
jlb = lbound(Matrix,2)
jub = ubound(Matrix,2)
klb = lbound(Matrix,3)
kub = ubound(Matrix,3)
!griflet: omp slowdown
if (present(MapMatrix)) then
!$OMP PARALLEL DO PRIVATE(i,j,k)
do k = klb, kub
do j = jlb, jub
do i = ilb, iub
if (MapMatrix(i, j, k) == 1) then
Matrix (i, j, k) = Valueb
endif
enddo
enddo
enddo
!$OMP END PARALLEL DO
else
!$OMP PARALLEL DO PRIVATE(i,j,k)
do k = klb, kub
do j = jlb, jub
do i = ilb, iub
Matrix (i, j, k) = Valueb
enddo
enddo
enddo
!$OMP END PARALLEL DO
endif
SetMatrixValues3D_R8_Constant = sumMatrix3D(Matrix)
end function SetMatrixValues3D_R8_Constant
Results