Personal tools

Difference between revisions of "DoLoops performance in Fortran"

From MohidWiki

Jump to: navigation, search
(Results)
(Results)
Line 104: Line 104:
  
 
===Results===
 
===Results===
----------------------------
+
----------------------------
DO (i,j,k) / NO CHUNK
+
DO (i,j,k) / NO CHUNK
----------------------------
+
----------------------------
 
+
Table A.1 - Debug do(i,j,k)
+
Table A.1 - Debug do(i,j,k)
Size Time
+
Size Time
100 0.04
+
100 0.04
200 0.37
+
200 0.37
300 1.58
+
300 1.58
400 7.60
+
400 7.60
500 19.66
+
500 19.66
600 41.65
+
600 41.65
 
+
Table A.2 - Debug openmp without !$OMP PARALLEL directives do(i,j,k)
+
Table A.2 - Debug openmp without !$OMP PARALLEL directives do(i,j,k)
Size Time
+
Size Time
100 0.04
+
100 0.04
200 0.37
+
200 0.37
300 1.58
+
300 1.58
400 7.27
+
400 7.27
500 19.34
+
500 19.34
600 41.34
+
600 41.34
 
+
Table A.3 - Debug openmp with one !$OMP PARALLEL DO directive do(i,j,k)
+
Table A.3 - Debug openmp with one !$OMP PARALLEL DO directive do(i,j,k)
Size Time
+
Size Time
100 0.02
+
100 0.02
200 0.19
+
200 0.19
300 0.70
+
300 0.70
400 1.86
+
400 1.86
500 4.05
+
500 4.05
600 7.83
+
600 7.83
 
+
----------------------------
+
----------------------------
DO (k,j,i) / NO CHUNK
+
DO (k,j,i) / NO CHUNK
----------------------------
+
----------------------------
 
+
Table B.1 - Debug do(k,j,i)
+
Table B.1 - Debug do(k,j,i)
Size Time
+
Size Time
100 0.04
+
100 0.04
200 0.31
+
200 0.31
300 1.22
+
300 1.22
400 3.36
+
400 3.36
500 7.55
+
500 7.55
600 14.88
+
600 14.88
 
+
Table B.2 - Debug openmp without !$OMP PARALLEL directives do(k,j,i)
+
Table B.2 - Debug openmp without !$OMP PARALLEL directives do(k,j,i)
Size Time
+
Size Time
100 0.04
+
100 0.04
200 0.31
+
200 0.31
300 1.21
+
300 1.21
400 3.36
+
400 3.36
500 7.82
+
500 7.82
600 15.07
+
600 15.07
 
+
Table B.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
+
Table B.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
Size Time
+
Size Time
100 0.02
+
100 0.02
200 0.09
+
200 0.09
300 0.36
+
300 0.36
400 0.94
+
400 0.94
500 2.04
+
500 2.04
600 3.89
+
600 3.89
 
+
----------------------------
+
----------------------------
DO (k,j,i) / STATIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1
+
DO (k,j,i) / STATIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1
----------------------------
+
----------------------------
 
+
Table C.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
+
Table C.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
Size Time
+
Size Time
100 0.02
+
100 0.02
200 0.15
+
200 0.15
300 0.42
+
300 0.42
400 1.03
+
400 1.03
500 2.12
+
500 2.12
600 3.97
+
600 3.97
 
+
----------------------------
+
----------------------------
DO (k,j,i) / STATIC CHUNK = 10
+
DO (k,j,i) / STATIC CHUNK = 10
----------------------------
+
----------------------------
 
+
Table D.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
+
Table D.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
Size Time
+
Size Time
100 0.02
+
100 0.02
200 0.16
+
200 0.16
300 0.43
+
300 0.43
400 1.04
+
400 1.04
500 2.18
+
500 2.18
600 4.05
+
600 4.05
 
+
----------------------------
+
----------------------------
DO (k,j,i) / DYNAMIC CHUNK = 10
+
DO (k,j,i) / DYNAMIC CHUNK = 10
----------------------------
+
----------------------------
 
+
Table E.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
+
Table E.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
Size Time
+
Size Time
100 0.01
+
100 0.01
200 0.10
+
200 0.10
300 0.36
+
300 0.36
400 0.93
+
400 0.93
500 2.01
+
500 2.01
600 3.89
+
600 3.89
 
+
----------------------------
+
----------------------------
DO (k,j,i) / DYNAMIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1
+
DO (k,j,i) / DYNAMIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1
----------------------------
+
----------------------------
 
+
Table F.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
+
Table F.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
Size Time
+
Size Time
100 0.02
+
100 0.02
200 0.09
+
200 0.09
300 0.39
+
300 0.39
400 1.04
+
400 1.04
500 2.14
+
500 2.14
600 4.00
+
600 4.00
  
 
===Conclusions===
 
===Conclusions===

Revision as of 16:29, 21 October 2010

What is the best performance that Fortran can give when computing do-loops over large matrices? The test-case below shows a four-times performance increase when looping over (k,j,i) insteado of (i,j,k).

Test-case

Hardware

  • Intel Core i7 - 870
  • 8 GB Ram

Code

  • The main program
program DoloopsOpenmp

   use moduleDoloopsOpenmp, only: makeloop

   implicit none
   
   integer, dimension(:,:,:), pointer      :: mycube
   integer                                 :: M = 1
   real                                    :: elapsedtime
   real                                    :: time = 0.0

   do while (M < 1000)

       write(*,*) 'Insert the cube size M (or insert 1000 to exit): '
       read(*,*) M
       
       if (M > 999) exit

       allocate(mycube(1:M,1:M,1:M))

       !Tic()    
       time = elapsedtime(time)

       call makeloop(mycube)

       !Toc()
       time = elapsedtime(time)
       
       write(*,10) time
       write(*,*) 
        
       deallocate(mycube)
       nullify(mycube)

   end do

   10 format('Time elapsed: ',F6.2)
   
end program DoloopsOpenmp

!This function computes the time
real function elapsedtime(lasttime)

   integer     :: count, count_rate
   real        :: lasttime

   call system_clock(count, count_rate)
   
   elapsedtime = count * 1.0 / count_rate - lasttime

end function elapsedtime

  • The module
module moduleDoloopsOpenmp

   use omp_lib

   implicit none

   private

   public :: makeloop

contains

   subroutine makeloop(cubicmatrix)
   
       !Arguments --------------
       integer, dimension(:,:,:), pointer  :: cubicmatrix
   
       !Local variables --------
       integer                             :: i, j, k, lb, ub
       
       lb = lbound(cubicmatrix,3)
       ub = ubound(cubicmatrix,3)        
       
       !$OMP PARALLEL PRIVATE(i,j,k)        
       !$OMP DO
       do k = lb, ub
       do j = lb, ub
       do i = lb, ub
   
           cubicmatrix(i,j,k) = cubicmatrix(i,j,k) + 1
   
       end do
       end do
       end do
       !$OMP END DO
       !$OMP END PARALLEL

   end subroutine makeloop
   
end module moduleDoloopsOpenmp

Results

----------------------------
DO (i,j,k) / NO CHUNK
----------------------------

Table A.1 - Debug do(i,j,k)
Size	Time
100		 0.04
200		 0.37
300		 1.58
400		 7.60
500		19.66
600		41.65

Table A.2 - Debug openmp without !$OMP PARALLEL directives do(i,j,k)
Size	Time
100		 0.04
200		 0.37
300		 1.58
400		 7.27
500		19.34
600		41.34

Table A.3 - Debug openmp with one !$OMP PARALLEL DO directive do(i,j,k)
Size	Time
100		 0.02
200		 0.19
300		 0.70
400		 1.86
500		 4.05
600		 7.83

----------------------------
DO (k,j,i) / NO CHUNK
----------------------------

Table B.1 - Debug do(k,j,i)
Size	Time
100		 0.04
200		 0.31
300		 1.22
400		 3.36
500		 7.55
600		14.88

Table B.2 - Debug openmp without !$OMP PARALLEL directives do(k,j,i)
Size	Time
100		 0.04
200		 0.31
300		 1.21
400		 3.36
500		 7.82
600		15.07

Table B.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
Size	Time
100		 0.02
200		 0.09
300		 0.36
400		 0.94
500		 2.04
600		 3.89

----------------------------
DO (k,j,i) / STATIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1
----------------------------

Table C.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
Size	Time
100		 0.02
200		 0.15
300		 0.42
400		 1.03
500		 2.12
600		 3.97

----------------------------
DO (k,j,i) / STATIC CHUNK = 10
----------------------------

Table D.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
Size	Time
100		 0.02
200		 0.16
300		 0.43
400		 1.04
500		 2.18
600		 4.05

----------------------------
DO (k,j,i) / DYNAMIC CHUNK = 10
----------------------------

Table E.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
Size	Time
100		 0.01
200		 0.10
300		 0.36
400		 0.93
500		 2.01
600		 3.89

----------------------------
DO (k,j,i) / DYNAMIC CHUNK = (UBOUND - LBOUND) / NTHREADS + 1
----------------------------

Table F.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
Size	Time
100		 0.02
200		 0.09
300		 0.39
400		 1.04
500		 2.14
600		 4.00

Conclusions

  • do(k,j,i) Vs do(i,j,k) ==> 2 to 4 times faster!
  • dynamic small chunks, or no chunk at all yield 10% increased performance over large dynamic chunks. Probably better off with no-chunk.
  • More test-cases representing different scenarios of do-loops may yield different choices of CHUNK/scheduling.