Difference between revisions of "DoLoops performance in Fortran"
From MohidWiki
(→Results) |
(→Results) |
||
Line 104: | Line 104: | ||
===Results=== | ===Results=== | ||
− | ---------------------------- | + | ---------------------------- |
− | DO (i,j,k) / NO CHUNK | + | DO (i,j,k) / NO CHUNK |
− | ---------------------------- | + | ---------------------------- |
− | + | ||
− | Table A.1 - Debug do(i,j,k) | + | Table A.1 - Debug do(i,j,k) |
− | Size Time | + | Size Time |
− | 100 0.04 | + | 100 0.04 |
− | 200 0.37 | + | 200 0.37 |
− | 300 1.58 | + | 300 1.58 |
− | 400 7.60 | + | 400 7.60 |
− | 500 19.66 | + | 500 19.66 |
− | 600 41.65 | + | 600 41.65 |
− | + | ||
− | Table A.2 - Debug openmp without !$OMP PARALLEL directives do(i,j,k) | + | Table A.2 - Debug openmp without !$OMP PARALLEL directives do(i,j,k) |
− | Size Time | + | Size Time |
− | 100 0.04 | + | 100 0.04 |
− | 200 0.37 | + | 200 0.37 |
− | 300 1.58 | + | 300 1.58 |
− | 400 7.27 | + | 400 7.27 |
− | 500 19.34 | + | 500 19.34 |
− | 600 41.34 | + | 600 41.34 |
− | + | ||
− | Table A.3 - Debug openmp with one !$OMP PARALLEL DO directive do(i,j,k) | + | Table A.3 - Debug openmp with one !$OMP PARALLEL DO directive do(i,j,k) |
− | Size Time | + | Size Time |
− | 100 0.02 | + | 100 0.02 |
− | 200 0.19 | + | 200 0.19 |
− | 300 0.70 | + | 300 0.70 |
− | 400 1.86 | + | 400 1.86 |
− | 500 4.05 | + | 500 4.05 |
− | 600 7.83 | + | 600 7.83 |
− | + | ||
− | ---------------------------- | + | ---------------------------- |
− | DO (k,j,i) / NO CHUNK | + | DO (k,j,i) / NO CHUNK |
− | ---------------------------- | + | ---------------------------- |
− | + | ||
− | Table B.1 - Debug do(k,j,i) | + | Table B.1 - Debug do(k,j,i) |
− | Size Time | + | Size Time |
− | 100 0.04 | + | 100 0.04 |
− | 200 0.31 | + | 200 0.31 |
− | 300 1.22 | + | 300 1.22 |
− | 400 3.36 | + | 400 3.36 |
− | 500 7.55 | + | 500 7.55 |
− | 600 14.88 | + | 600 14.88 |
− | + | ||
− | Table B.2 - Debug openmp without !$OMP PARALLEL directives do(k,j,i) | + | Table B.2 - Debug openmp without !$OMP PARALLEL directives do(k,j,i) |
− | Size Time | + | Size Time |
− | 100 0.04 | + | 100 0.04 |
− | 200 0.31 | + | 200 0.31 |
− | 300 1.21 | + | 300 1.21 |
− | 400 3.36 | + | 400 3.36 |
− | 500 7.82 | + | 500 7.82 |
− | 600 15.07 | + | 600 15.07 |
− | + | ||
− | Table B.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) | + | Table B.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) |
− | Size Time | + | Size Time |
− | 100 0.02 | + | 100 0.02 |
− | 200 0.09 | + | 200 0.09 |
− | 300 0.36 | + | 300 0.36 |
− | 400 0.94 | + | 400 0.94 |
− | 500 2.04 | + | 500 2.04 |
− | 600 3.89 | + | 600 3.89 |
− | + | ||
− | ---------------------------- | + | ---------------------------- |
− | DO (k,j,i) / STATIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1 | + | DO (k,j,i) / STATIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1 |
− | ---------------------------- | + | ---------------------------- |
− | + | ||
− | Table C.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) | + | Table C.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) |
− | Size Time | + | Size Time |
− | 100 0.02 | + | 100 0.02 |
− | 200 0.15 | + | 200 0.15 |
− | 300 0.42 | + | 300 0.42 |
− | 400 1.03 | + | 400 1.03 |
− | 500 2.12 | + | 500 2.12 |
− | 600 3.97 | + | 600 3.97 |
− | + | ||
− | ---------------------------- | + | ---------------------------- |
− | DO (k,j,i) / STATIC CHUNK = 10 | + | DO (k,j,i) / STATIC CHUNK = 10 |
− | ---------------------------- | + | ---------------------------- |
− | + | ||
− | Table D.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) | + | Table D.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) |
− | Size Time | + | Size Time |
− | 100 0.02 | + | 100 0.02 |
− | 200 0.16 | + | 200 0.16 |
− | 300 0.43 | + | 300 0.43 |
− | 400 1.04 | + | 400 1.04 |
− | 500 2.18 | + | 500 2.18 |
− | 600 4.05 | + | 600 4.05 |
− | + | ||
− | ---------------------------- | + | ---------------------------- |
− | DO (k,j,i) / DYNAMIC CHUNK = 10 | + | DO (k,j,i) / DYNAMIC CHUNK = 10 |
− | ---------------------------- | + | ---------------------------- |
− | + | ||
− | Table E.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) | + | Table E.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) |
− | Size Time | + | Size Time |
− | 100 0.01 | + | 100 0.01 |
− | 200 0.10 | + | 200 0.10 |
− | 300 0.36 | + | 300 0.36 |
− | 400 0.93 | + | 400 0.93 |
− | 500 2.01 | + | 500 2.01 |
− | 600 3.89 | + | 600 3.89 |
− | + | ||
− | ---------------------------- | + | ---------------------------- |
− | DO (k,j,i) / DYNAMIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1 | + | DO (k,j,i) / DYNAMIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1 |
− | ---------------------------- | + | ---------------------------- |
− | + | ||
− | Table F.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) | + | Table F.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) |
− | Size Time | + | Size Time |
− | 100 0.02 | + | 100 0.02 |
− | 200 0.09 | + | 200 0.09 |
− | 300 0.39 | + | 300 0.39 |
− | 400 1.04 | + | 400 1.04 |
− | 500 2.14 | + | 500 2.14 |
− | 600 4.00 | + | 600 4.00 |
===Conclusions=== | ===Conclusions=== |
Revision as of 16:29, 21 October 2010
What is the best performance that Fortran can give when computing do-loops over large matrices? The test-case below shows a four-times performance increase when looping over (k,j,i) insteado of (i,j,k).
Test-case
Hardware
- Intel Core i7 - 870
- 8 GB Ram
Code
- The main program
program DoloopsOpenmp use moduleDoloopsOpenmp, only: makeloop implicit none integer, dimension(:,:,:), pointer :: mycube integer :: M = 1 real :: elapsedtime real :: time = 0.0 do while (M < 1000) write(*,*) 'Insert the cube size M (or insert 1000 to exit): ' read(*,*) M if (M > 999) exit allocate(mycube(1:M,1:M,1:M)) !Tic() time = elapsedtime(time) call makeloop(mycube) !Toc() time = elapsedtime(time) write(*,10) time write(*,*) deallocate(mycube) nullify(mycube) end do 10 format('Time elapsed: ',F6.2) end program DoloopsOpenmp !This function computes the time real function elapsedtime(lasttime) integer :: count, count_rate real :: lasttime call system_clock(count, count_rate) elapsedtime = count * 1.0 / count_rate - lasttime end function elapsedtime
- The module
module moduleDoloopsOpenmp use omp_lib implicit none private public :: makeloop contains subroutine makeloop(cubicmatrix) !Arguments -------------- integer, dimension(:,:,:), pointer :: cubicmatrix !Local variables -------- integer :: i, j, k, lb, ub lb = lbound(cubicmatrix,3) ub = ubound(cubicmatrix,3) !$OMP PARALLEL PRIVATE(i,j,k) !$OMP DO do k = lb, ub do j = lb, ub do i = lb, ub cubicmatrix(i,j,k) = cubicmatrix(i,j,k) + 1 end do end do end do !$OMP END DO !$OMP END PARALLEL end subroutine makeloop end module moduleDoloopsOpenmp
Results
---------------------------- DO (i,j,k) / NO CHUNK ---------------------------- Table A.1 - Debug do(i,j,k) Size Time 100 0.04 200 0.37 300 1.58 400 7.60 500 19.66 600 41.65 Table A.2 - Debug openmp without !$OMP PARALLEL directives do(i,j,k) Size Time 100 0.04 200 0.37 300 1.58 400 7.27 500 19.34 600 41.34 Table A.3 - Debug openmp with one !$OMP PARALLEL DO directive do(i,j,k) Size Time 100 0.02 200 0.19 300 0.70 400 1.86 500 4.05 600 7.83 ---------------------------- DO (k,j,i) / NO CHUNK ---------------------------- Table B.1 - Debug do(k,j,i) Size Time 100 0.04 200 0.31 300 1.22 400 3.36 500 7.55 600 14.88 Table B.2 - Debug openmp without !$OMP PARALLEL directives do(k,j,i) Size Time 100 0.04 200 0.31 300 1.21 400 3.36 500 7.82 600 15.07 Table B.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) Size Time 100 0.02 200 0.09 300 0.36 400 0.94 500 2.04 600 3.89 ---------------------------- DO (k,j,i) / STATIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1 ---------------------------- Table C.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) Size Time 100 0.02 200 0.15 300 0.42 400 1.03 500 2.12 600 3.97 ---------------------------- DO (k,j,i) / STATIC CHUNK = 10 ---------------------------- Table D.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) Size Time 100 0.02 200 0.16 300 0.43 400 1.04 500 2.18 600 4.05 ---------------------------- DO (k,j,i) / DYNAMIC CHUNK = 10 ---------------------------- Table E.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) Size Time 100 0.01 200 0.10 300 0.36 400 0.93 500 2.01 600 3.89 ---------------------------- DO (k,j,i) / DYNAMIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1 ---------------------------- Table F.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i) Size Time 100 0.02 200 0.09 300 0.39 400 1.04 500 2.14 600 4.00
Conclusions
- do(k,j,i) Vs do(i,j,k) ==> 2 to 4 times faster!
- dynamic small chunks, or no chunk at all yield 10% increased performance over large dynamic chunks. Probably better off with no-chunk.
- More test-cases representing different scenarios of do-loops may yield different choices of CHUNK/scheduling.