* that changing the chunk size or alternating between static and dynamic scheduling yield less than 10% differences in performance. Best performance is obtained, for this test-case, with a small dynamic chunk.
===Simple triple do-loop===
Performs a triple-do-loop with a simple computation over a cubic matrix. Size is chosen by the user through standard input.
*Intel Core i7 - 870
*8 GB Ram
* The main program
  program DoloopsOpenmp
  end module moduleDoloopsOpenmp
  end module moduleDoloopsOpenmp
*Full results
  600 4.00
* do(k,j,i) Vs do(i,j,k) ==> 2 to 4 times faster!
* dynamic small chunks, or no chunk at all yield 10% increased performance over large dynamic chunks. Probably better off with no-chunk.
* More test-cases representing different scenarios of do-loops may yield different choices of CHUNK/scheduling.
=== SetMatrix3D_Constant===
This subroutine is in the ModuleFunctions of MohidWater. In the context of MohidWater, parallelizing this subroutine yields a worse performance. However, in the little test program, the same OMP directives yield quite good results.
*Core i7-870
    real function SetMatrixValues3D_R8_Constant (Matrix, Valueb, MapMatrix)
        real, dimension(:, :, :), pointer              :: Matrix
        real, intent (IN)                              :: Valueb
        integer, dimension(:, :, :), pointer, optional  :: MapMatrix
        integer                                        :: i, j, k
        integer                                        :: ilb, iub, jlb, jub, klb, kub
        ilb = lbound(Matrix,1)
        iub = ubound(Matrix,1)
        jlb = lbound(Matrix,2)
        jub = ubound(Matrix,2)
        klb = lbound(Matrix,3)
        kub = ubound(Matrix,3)
        !griflet: omp slowdown
        if (present(MapMatrix)) then
            !$OMP PARALLEL DO PRIVATE(i,j,k)
            do k = klb, kub
            do j = jlb, jub
            do i = ilb, iub
                if (MapMatrix(i, j, k) == 1) then
                    Matrix (i, j, k) = Valueb
            !$OMP END PARALLEL DO
            !$OMP PARALLEL DO PRIVATE(i,j,k)
            do k = klb, kub
            do j = jlb, jub
            do i = ilb, iub
                Matrix (i, j, k) = Valueb
            !$OMP END PARALLEL DO
        SetMatrixValues3D_R8_Constant = sumMatrix3D(Matrix)
    end function SetMatrixValues3D_R8_Constant

Revision as of 15:58, 22 October 2010

What is the best performance that Fortran can give when computing do-loops over large matrices? A single triple-do-loop test-case was implemented with matrix size ranging from 1M to 216M four-byte units. The test-case below shows:

  • a 400% performance increase when looping with the order (k,j,i) instead of looping with the order (i,j,k),
  • that a 300% performance increase occur when using openmp directives in a quad-core processor (i7-870),
  • that changing the chunk size or alternating between static and dynamic scheduling yield less than 10% differences in performance. Best performance is obtained, for this test-case, with a small dynamic chunk.


Simple triple do-loop


Performs a triple-do-loop with a simple computation over a cubic matrix. Size is chosen by the user through standard input.


  • Intel Core i7 - 870
  • 8 GB Ram


  • The main program
program DoloopsOpenmp

   use moduleDoloopsOpenmp, only: makeloop

   implicit none
   integer, dimension(:,:,:), pointer      :: mycube
   integer                                 :: M = 1
   real                                    :: elapsedtime
   real                                    :: time = 0.0

   do while (M < 1000)

       write(*,*) 'Insert the cube size M (or insert 1000 to exit): '
       read(*,*) M
       if (M > 999) exit


       time = elapsedtime(time)

       call makeloop(mycube)

       time = elapsedtime(time)
       write(*,10) time

   end do

   10 format('Time elapsed: ',F6.2)
end program DoloopsOpenmp

!This function computes the time
real function elapsedtime(lasttime)

   integer     :: count, count_rate
   real        :: lasttime

   call system_clock(count, count_rate)
   elapsedtime = count * 1.0 / count_rate - lasttime

end function elapsedtime

  • The module
module moduleDoloopsOpenmp

   use omp_lib

   implicit none


   public :: makeloop


   subroutine makeloop(cubicmatrix)
       !Arguments --------------
       integer, dimension(:,:,:), pointer  :: cubicmatrix
       !Local variables --------
       integer                             :: i, j, k, lb, ub
       lb = lbound(cubicmatrix,3)
       ub = ubound(cubicmatrix,3)        
       !$OMP PARALLEL PRIVATE(i,j,k)        
       !$OMP DO
       do k = lb, ub
       do j = lb, ub
       do i = lb, ub
           cubicmatrix(i,j,k) = cubicmatrix(i,j,k) + 1
       end do
       end do
       end do
       !$OMP END DO

   end subroutine makeloop
end module moduleDoloopsOpenmp


  • Full results

  • Looking only at the results with STATIC/DYNAMIC/CHUNK variations.
DO (i,j,k) / NO CHUNK
Table A.1 - Debug do(i,j,k)
Size	Time
100		 0.04
200		 0.37
300		 1.58
400		 7.60
500		19.66
600		41.65

Table A.2 - Debug openmp without !$OMP PARALLEL directives do(i,j,k)
Size	Time
100		 0.04
200		 0.37
300		 1.58
400		 7.27
500		19.34
600		41.34

Table A.3 - Debug openmp with one !$OMP PARALLEL DO directive do(i,j,k)
Size	Time
100		 0.02
200		 0.19
300		 0.70
400		 1.86
500		 4.05
600		 7.83

DO (k,j,i) / NO CHUNK

Table B.1 - Debug do(k,j,i)
Size	Time
100		 0.04
200		 0.31
300		 1.22
400		 3.36
500		 7.55
600		14.88

Table B.2 - Debug openmp without !$OMP PARALLEL directives do(k,j,i)
Size	Time
100		 0.04
200		 0.31
300		 1.21
400		 3.36
500		 7.82
600		15.07

Table B.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
Size	Time
100		 0.02
200		 0.09
300		 0.36
400		 0.94
500		 2.04
600		 3.89


Table C.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
Size	Time
100		 0.02
200		 0.15
300		 0.42
400		 1.03
500		 2.12
600		 3.97

DO (k,j,i) / STATIC CHUNK = 10

Table D.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
Size	Time
100		 0.02
200		 0.16
300		 0.43
400		 1.04
500		 2.18
600		 4.05

DO (k,j,i) / DYNAMIC CHUNK = 10

Table E.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
Size	Time
100		 0.01
200		 0.10
300		 0.36
400		 0.93
500		 2.01
600		 3.89


Table F.3 - Debug openmp with one !$OMP PARALLEL DO directive do(k,j,i)
Size	Time
100		 0.02
200		 0.09
300		 0.39
400		 1.04
500		 2.14
600		 4.00


  • do(k,j,i) Vs do(i,j,k) ==> 2 to 4 times faster!
  • dynamic small chunks, or no chunk at all yield 10% increased performance over large dynamic chunks. Probably better off with no-chunk.
  • More test-cases representing different scenarios of do-loops may yield different choices of CHUNK/scheduling.



This subroutine is in the ModuleFunctions of MohidWater. In the context of MohidWater, parallelizing this subroutine yields a worse performance. However, in the little test program, the same OMP directives yield quite good results.


  • Core i7-870
  • 8 GB RAM


   real function SetMatrixValues3D_R8_Constant (Matrix, Valueb, MapMatrix)

       real, dimension(:, :, :), pointer               :: Matrix
       real, intent (IN)                               :: Valueb
       integer, dimension(:, :, :), pointer, optional  :: MapMatrix

       integer                                         :: i, j, k
       integer                                         :: ilb, iub, jlb, jub, klb, kub


       ilb = lbound(Matrix,1)
       iub = ubound(Matrix,1)
       jlb = lbound(Matrix,2)
       jub = ubound(Matrix,2)
       klb = lbound(Matrix,3)
       kub = ubound(Matrix,3)

       !griflet: omp slowdown
       if (present(MapMatrix)) then
           !$OMP PARALLEL DO PRIVATE(i,j,k)
           do k = klb, kub
           do j = jlb, jub
           do i = ilb, iub
               if (MapMatrix(i, j, k) == 1) then
                   Matrix (i, j, k) = Valueb
           !$OMP END PARALLEL DO
           !$OMP PARALLEL DO PRIVATE(i,j,k)
           do k = klb, kub
           do j = jlb, jub
           do i = ilb, iub
               Matrix (i, j, k) = Valueb
           !$OMP END PARALLEL DO

       SetMatrixValues3D_R8_Constant = sumMatrix3D(Matrix)

   end function SetMatrixValues3D_R8_Constant
