Difference between revisions of "DoLoops performance in Fortran"
From MohidWiki
(→Results) |
(→Results) |
||
| Line 104: | Line 104: | ||
===Results=== | ===Results=== | ||
| − | ---------------------------- | + | ---------------------------- |
| − | DO (i,j,k) / NO CHUNK | + | DO (i,j,k) / NO CHUNK |
| − | ---------------------------- | + | ---------------------------- |
| − | + | ||
| − | Table A.1 - Debug do(i,j,k) | + | Table A.1 - Debug do(i,j,k) |
| − | Size Time | + | Size Time |
| − | 100 0.04 | + | 100 0.04 |
| − | 200 0.37 | + | 200 0.37 |
| − | 300 1.58 | + | 300 1.58 |
| − | 400 7.60 | + | 400 7.60 |
| − | 500 19.66 | + | 500 19.66 |
| − | 600 41.65 | + | 600 41.65 |
| − | + | ||
| − | Table A.2 - Debug openmp without !$OMP PARALLEL directives do(i,j,k) | + | Table A.2 - Debug openmp without !$OMP PARALLEL directives do(i,j,k) |
| − | Size Time | + | Size Time |
| − | 100 0.04 | + | 100 0.04 |
| − | 200 0.37 | + | 200 0.37 |
| − | 300 1.58 | + | 300 1.58 |
| − | 400 7.27 | + | 400 7.27 |
| − | 500 19.34 | + | 500 19.34 |
| − | 600 41.34 | + | 600 41.34 |
| − | + | ||
| − | Table A.3 - Debug openmp with one !$OMP PARALLEL DO directive do(i,j,k) | + | Table A.3 - Debug openmp with one !$OMP PARALLEL DO directive do(i,j,k) |
| − | Size Time | + | Size Time |
| − | 100 0.02 | + | 100 0.02 |
| − | 200 0.19 | + | 200 0.19 |
| − | 300 0.70 | + | 300 0.70 |
| − | 400 1.86 | + | 400 1.86 |
| − | 500 4.05 | + | 500 4.05 |
| − | 600 7.83 | + | 600 7.83 |
| − | + | ||
| − | ---------------------------- | + | ---------------------------- |
| − | DO (k,j,i) / NO CHUNK | + | DO (k,j,i) / NO CHUNK |
| − | ---------------------------- | + | ---------------------------- |
| − | + | ||
| − | Table B.1 - Debug do(k,j,i) | + | Table B.1 - Debug do(k,j,i) |
| − | Size Time | + | Size Time |
| − | 100 0.04 | + | 100 0.04 |
| − | 200 0.31 | + | 200 0.31 |
| − | 300 1.22 | + | 300 1.22 |
| − | 400 3.36 | + | 400 3.36 |
| − | 500 7.55 | + | 500 7.55 |
| − | 600 14.88 | + | 600 14.88 |
| − | + | ||
| − | Table B.2 - Debug openmp without !$OMP PARALLEL directives do(k,j,i) | + | Table B.2 - Debug openmp without !$OMP PARALLEL directives do(k,j,i) |
| − | Size Time | + | Size Time |
| − | 100 0.04 | + | 100 0.04 |
| − | 200 0.31 | + | 200 0.31 |
| − | 300 1.21 | + | 300 1.21 |
| − | 400 3.36 | + | 400 3.36 |
| − | 500 7.82 | + | 500 7.82 |
| − | 600 15.07 | + | 600 15.07 |
| − | + | ||
| − | Table B.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) | + | Table B.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) |
| − | Size Time | + | Size Time |
| − | 100 0.02 | + | 100 0.02 |
| − | 200 0.09 | + | 200 0.09 |
| − | 300 0.36 | + | 300 0.36 |
| − | 400 0.94 | + | 400 0.94 |
| − | 500 2.04 | + | 500 2.04 |
| − | 600 3.89 | + | 600 3.89 |
| − | + | ||
| − | ---------------------------- | + | ---------------------------- |
| − | DO (k,j,i) / STATIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1 | + | DO (k,j,i) / STATIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1 |
| − | ---------------------------- | + | ---------------------------- |
| − | + | ||
| − | Table C.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) | + | Table C.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) |
| − | Size Time | + | Size Time |
| − | 100 0.02 | + | 100 0.02 |
| − | 200 0.15 | + | 200 0.15 |
| − | 300 0.42 | + | 300 0.42 |
| − | 400 1.03 | + | 400 1.03 |
| − | 500 2.12 | + | 500 2.12 |
| − | 600 3.97 | + | 600 3.97 |
| − | + | ||
| − | ---------------------------- | + | ---------------------------- |
| − | DO (k,j,i) / STATIC CHUNK = 10 | + | DO (k,j,i) / STATIC CHUNK = 10 |
| − | ---------------------------- | + | ---------------------------- |
| − | + | ||
| − | Table D.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) | + | Table D.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) |
| − | Size Time | + | Size Time |
| − | 100 0.02 | + | 100 0.02 |
| − | 200 0.16 | + | 200 0.16 |
| − | 300 0.43 | + | 300 0.43 |
| − | 400 1.04 | + | 400 1.04 |
| − | 500 2.18 | + | 500 2.18 |
| − | 600 4.05 | + | 600 4.05 |
| − | + | ||
| − | ---------------------------- | + | ---------------------------- |
| − | DO (k,j,i) / DYNAMIC CHUNK = 10 | + | DO (k,j,i) / DYNAMIC CHUNK = 10 |
| − | ---------------------------- | + | ---------------------------- |
| − | + | ||
| − | Table E.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) | + | Table E.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) |
| − | Size Time | + | Size Time |
| − | 100 0.01 | + | 100 0.01 |
| − | 200 0.10 | + | 200 0.10 |
| − | 300 0.36 | + | 300 0.36 |
| − | 400 0.93 | + | 400 0.93 |
| − | 500 2.01 | + | 500 2.01 |
| − | 600 3.89 | + | 600 3.89 |
| − | + | ||
| − | ---------------------------- | + | ---------------------------- |
| − | DO (k,j,i) / DYNAMIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1 | + | DO (k,j,i) / DYNAMIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1 |
| − | ---------------------------- | + | ---------------------------- |
| − | + | ||
| − | Table F.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) | + | Table F.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) |
| − | Size Time | + | Size Time |
| − | 100 0.02 | + | 100 0.02 |
| − | 200 0.09 | + | 200 0.09 |
| − | 300 0.39 | + | 300 0.39 |
| − | 400 1.04 | + | 400 1.04 |
| − | 500 2.14 | + | 500 2.14 |
| − | 600 4.00 | + | 600 4.00 |
===Conclusions=== | ===Conclusions=== | ||
Revision as of 16:29, 21 October 2010
What is the best performance that Fortran can give when computing do-loops over large matrices? The test-case below shows a four-times performance increase when looping over (k,j,i) insteado of (i,j,k).
Test-case
Hardware
- Intel Core i7 - 870
- 8 GB Ram
Code
- The main program
program DoloopsOpenmp
use moduleDoloopsOpenmp, only: makeloop
implicit none
integer, dimension(:,:,:), pointer :: mycube
integer :: M = 1
real :: elapsedtime
real :: time = 0.0
do while (M < 1000)
write(*,*) 'Insert the cube size M (or insert 1000 to exit): '
read(*,*) M
if (M > 999) exit
allocate(mycube(1:M,1:M,1:M))
!Tic()
time = elapsedtime(time)
call makeloop(mycube)
!Toc()
time = elapsedtime(time)
write(*,10) time
write(*,*)
deallocate(mycube)
nullify(mycube)
end do
10 format('Time elapsed: ',F6.2)
end program DoloopsOpenmp
!This function computes the time
real function elapsedtime(lasttime)
integer :: count, count_rate
real :: lasttime
call system_clock(count, count_rate)
elapsedtime = count * 1.0 / count_rate - lasttime
end function elapsedtime
- The module
module moduleDoloopsOpenmp
use omp_lib
implicit none
private
public :: makeloop
contains
subroutine makeloop(cubicmatrix)
!Arguments --------------
integer, dimension(:,:,:), pointer :: cubicmatrix
!Local variables --------
integer :: i, j, k, lb, ub
lb = lbound(cubicmatrix,3)
ub = ubound(cubicmatrix,3)
!$OMP PARALLEL PRIVATE(i,j,k)
!$OMP DO
do k = lb, ub
do j = lb, ub
do i = lb, ub
cubicmatrix(i,j,k) = cubicmatrix(i,j,k) + 1
end do
end do
end do
!$OMP END DO
!$OMP END PARALLEL
end subroutine makeloop
end module moduleDoloopsOpenmp
Results
---------------------------- DO (i,j,k) / NO CHUNK ---------------------------- Table A.1 - Debug do(i,j,k) Size Time 100 0.04 200 0.37 300 1.58 400 7.60 500 19.66 600 41.65 Table A.2 - Debug openmp without !$OMP PARALLEL directives do(i,j,k) Size Time 100 0.04 200 0.37 300 1.58 400 7.27 500 19.34 600 41.34 Table A.3 - Debug openmp with one !$OMP PARALLEL DO directive do(i,j,k) Size Time 100 0.02 200 0.19 300 0.70 400 1.86 500 4.05 600 7.83 ---------------------------- DO (k,j,i) / NO CHUNK ---------------------------- Table B.1 - Debug do(k,j,i) Size Time 100 0.04 200 0.31 300 1.22 400 3.36 500 7.55 600 14.88 Table B.2 - Debug openmp without !$OMP PARALLEL directives do(k,j,i) Size Time 100 0.04 200 0.31 300 1.21 400 3.36 500 7.82 600 15.07 Table B.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) Size Time 100 0.02 200 0.09 300 0.36 400 0.94 500 2.04 600 3.89 ---------------------------- DO (k,j,i) / STATIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1 ---------------------------- Table C.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) Size Time 100 0.02 200 0.15 300 0.42 400 1.03 500 2.12 600 3.97 ---------------------------- DO (k,j,i) / STATIC CHUNK = 10 ---------------------------- Table D.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) Size Time 100 0.02 200 0.16 300 0.43 400 1.04 500 2.18 600 4.05 ---------------------------- DO (k,j,i) / DYNAMIC CHUNK = 10 ---------------------------- Table E.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) Size Time 100 0.01 200 0.10 300 0.36 400 0.93 500 2.01 600 3.89 ---------------------------- DO (k,j,i) / DYNAMIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1 ---------------------------- Table F.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) Size Time 100 0.02 200 0.09 300 0.39 400 1.04 500 2.14 600 4.00
Conclusions
- do(k,j,i) Vs do(i,j,k) ==> 2 to 4 times faster!
- dynamic small chunks, or no chunk at all yield 10% increased performance over large dynamic chunks. Probably better off with no-chunk.
- More test-cases representing different scenarios of do-loops may yield different choices of CHUNK/scheduling.