Fortran memo (Michikazu Kobayashi)

Conversion between character and number

character → number

                character(len=2) :: ca
                integer :: a
                ca='10'
                read(ca,*) a

number → character

                character(len=2) :: ca
                integer :: a
                a=10
                write(ca,*) a

write string without specifying number of characters (without spaces)

                character(len=100) :: ca
                ca='abcdefg'
                write(*,'(a)') trim(ca)

write out an integer without any digits (no spaces)

                integer :: i
                i=10000
                write(*,'(i0)') i

zero fill when writing out an integer

Display with 7 digits and up to 6 digits zero-fill display

                integer :: i
                i=12345
                write(*,'(i7.6)') i

Automatically change character count

With allocatable, when the string is changed, the character count is also changed automatically

                character(:),allocatable :: ca
                ca='abcdefg'
                write(*,*) len(ca)
                ca='abcdefghijklmn'
                write(*,*) len(ca)

Add something in once closed file
position='append' in the open statement. However, only valid for formatted
Maximum and minimum values for numbers

type	minimum	maxmum	significant digits
integer(1)	-128	127
integer(2)	-32768	32767
integer(4)	-2147483648	2147483648
integer(8)	-9223372936854775808	9223372936854775808
real(4)	1.17549435e-38	3.40282347e38	8
real(8)	2.2250738585072013d-308	1.7976931348623158d308	16

Debugging options
No optimization at all
```
                ifort -O0 hoge.f90
```
Check for segmentation errors due to out-of-area use of arrays or other memory abuse
```
                ifort -CB -traceback hoge.f90
```
Check for zero division or value overflow
```
                ifort -fpe0 -traceback hoge.f90
```
When executed with the above options, a detailed error message is displayed and calculation stops.
Optimization options
Automatic vectorization
```
                ifort -xHost hoge.f90
```
Compile while embedding all subroutine call parts.
```
                ifort -ipo hoge.f90
```
If you have a very large number of subroutines, you can hope for some speedup (Note: -ipo option is not recommended. The order of execution of the subroutines may change internally and you may not get the desired result. It is true that -ipo sometimes makes it faster, but it is better to give up than to spend extra time suffering from the -ipo bug.)
Optimization of operations
```
                ifort -O1 hoge.f90
                ifort -O2 hoge.f90
                ifort -O3 hoge.f90
```
any of the following: O1 → O2 → O3, in that order the optimization level increases, but O3 is not always faster than O2 (it depends on how good the program is written). Also, depending on how the program is written, increasing the optimization level may cause bugs. The default is O2
Faster division
```
                ifort -no-prec-div hoge.f90
```
Division takes the longest time among addition, subtraction and multiplication. Therefore, we can speed up division by reducing the precision of division only (it is better to reduce the number of divisions as much as possible. For example, instead of dividing by 2, multiply by 0.5, or use bit shifting for integers).

In the case of Fortran, when an array of two or more dimensions is prepared, the order of memory actually prepared is the inner side first. For example

                real(8) :: a(3,3)

the actual order of allocated memory is

(same as julia, but opposite to c/c++, java, python, etc.) If you ever manipulate an array using a do statement, it is faster to manipulate it according to memory order. For example,

                do i=1,3
                  do j=1,3
                    a(i,j)=i+j
                  enddo
                enddo

and

                do j=1,3
                  do i=1,3
                    a(i,j)=i+j
                  enddo
                enddo

The operations performed are the same, but the latter is faster than the former because it operates according to the memory order.

Staircase function $ \theta(x) $
```
                0.5d0*(dsign(1.d0,x)+1.d0)
```
However, there are many cases where using if statements is faster.The only advantage may be that it can be written neatly on a single line.
Built-in function atan2(y,x)=atan(y/x)
Note that it is not atan2(x,y) (same as gnuplot, opposite of Mathematica)

Writing and reading data is often the bottleneck of numerical calculation. For example, array data can be written out

                open(10,file='hoge.dat')
                do j=1,3
                  do i=1,3
                    write(10,*) a(i,j)
                  enddo
                enddo
                close(10)

instead of writing data one by one like

                open(10,file='hoge.dat')
                write(10,*) ((a(i,j),i=1,3),j=1,3)
                close(10)

If the data can be written out together like this, it is faster.

There are two types of data files: human-readable (ascii) and PC-readable (binary). If the data does not need to be read by humans (e.g., just passing the data to another program), writing out the data in binary format will result in surprisingly small writing time and file size (binary is also much faster to read). Incidentally, AVS/Express and Python's Numpy can read data files in binary format. It seems that with some training, gnuplot can also read binary data files. Also, Paraview requires the anomalous technique of mixing ASCII when reading in binary data. Example of writing data in binary format
```
                open(10,file='hoge',form='binary')
                write(10) ((a(i,j),i=1,3),j=1,3)
                close(10)
```
You can do this. It is better not to add an extension, or if you do, add an extension different from dat (e.g. bin?).
About unformatted, binary
- unformatted : When used with access="sequential" (default) , 4 byte header and footer each time data is written out with a write statement. It is somewhat similar to the formatted format, where a newline code is inserted every time a write statement is written. It is useless.
- By the way, if you want to avoid line feed codes in the formatted format
```
                    write(10,*,advance='no')
```
  binary : It is easy to use because neither header nor footer are included in the write statement. However, it is not implemented except for Intel compilers.
- If you must use unformatted except for Intel compiler, use access='direct', recl=bytes or access='stream'.In the former case, the number of bytes specified by recl is used as one group
```
                    write(10,rec=5)	
```
  will write from the top of the file to a location (the number specified by recl)×5 bytes. If there is no group concept,
```
                    write(10,pos=5)	
```
  using stream will write to the 5th byte from the top of the file. Incidentally, in the case of formatted, the number of character is used instead of bytes (including line breaks).
Since calculators can only handle finite-digit numbers, there are cases where digits are dropped between addition and subtraction operations.For example, (a+b)+c and a+(b+c) are not necessarily equal. It is when a-b is close to a and b and the a-b operation is performed that the digit drop becomes serious. Two examples of problems are shown below.
The first is the variance
$ \displaystyle \qquad \frac{1}{N} \sum (x - \langle x \rangle)^2 $
when calculating
$ \displaystyle \qquad \frac{1}{N} \sum x^2 - \langle x \rangle^2 $
a large digit drop may occur. To avoid this order of magnitude drop, either use the former method of calculation or, if memory is a problem, use the assumption of $\langle x \rangle$ as $x_0$ and set
$ \displaystyle \qquad \frac{1}{N} \sum (x - x_0)^2 - (x_0 - \langle x \rangle)^2 $
Another example is the quadratic equation with the formula
$ \displaystyle \qquad \frac{- b + \sqrt{b^2 - 4 a c}}{2 a}, \quad \frac{- b - \sqrt{b^2 - 4 a c}}{2 a} $
to obtain a solution using $4 a c$ is when the absolute value is small. In this case $b \geq 0$ or $b \leq 0$ would cause a large digit drop in the former and latter, respectively. To avoid this, rationalize the numerator to the one where the digit drop may occur
$ \displaystyle \qquad \frac{- 2 c}{b + \sqrt{b^2 - 4 a c}}, \quad \frac{- 2 c}{b - \sqrt{b^2 - 4 a c}} $

Do not use strides in numerical calculations or reading/writing files. For example, when using a library of Fourier transforms to perform multiple Fourier transforms

                program main
                  use mkl_dfti
                  implicit none
                  type(dfti_descriptor),pointer :: plandft
                  integer,parameter :: n=1024
                  integer :: statdft
                  complex(8) :: zf(n*5)
                  statdft=dfticreatedescriptor(plandft,dfti_double,dfti_complex,1,n)
                  statdft=dftisetvalue(plandft,dfti_number_of_transforms,5)
                  statdft=dftisetvalue(plandft,dfti_input_distance,1)
                  statdft=dftisetvalue(plandft,dfti_output_distance,1)
                  statdft=dftisetvalue(plandft,dfti_input_strides,(/0,5/))
                  statdft=dftisetvalue(plandft,dfti_output_strides,(/0,5/))
                  statdft=dfticommitdescriptor(plandft)
                  statdft=dfticomputeforward(plandft,zf)
                  stop
                end program main

rather than using a stride like

                program main
                  use mkl_dfti
                  implicit none
                  type(dfti_descriptor),pointer :: plandft
                  integer,parameter :: n=1024
                  integer :: statdft
                  complex(8) :: zf(n*5)
                  statdft=dfticreatedescriptor(plandft,dfti_double,dfti_complex,1,n)
                  statdft=dftisetvalue(plandft,dfti_number_of_transforms,5)
                  statdft=dftisetvalue(plandft,dfti_input_distance,n)
                  statdft=dftisetvalue(plandft,dfti_output_distance,n)
                  statdft=dfticommitdescriptor(plandft)
                  statdft=dfticomputeforward(plandft,zf)
                  stop
                end program main

to read the data as a single chunk of data for Fourier transform (although stride is necessary if you want to Fourier transform only the 2nd or 3rd dimension). When reading data in AVS/Express, etc., do not use strides, but read the data as a single chunk.

Cooperation with Torque-Maui

OpenPBS automatically allocates all the numbers specified in ppn to OpenMP parallel
```
                export OMP_NUM_THREADS=$PBS_NUM_PPN
```

Environment variables specified in the script can be specified with the -v option. For example, in a script

                #PBS -o out-$test1-$test2.log
                ./$out.exe $test1 $test2

and so on,

                qsub ./hoge.sh -v out=test,test1=1,test2=2

makes

                ./test.exe 1 2

will be executed and a log file named out-1-2.log will be created.

OpenMP

Parallelization of do statements
Completely independent do statements can be easily parallelized
```
                real(8) :: a(1:100),b(1:100),c(1:100)
                omit
                !$OMP PARALLEL DO
                do i=1,100
                  a(i)=b(i)*c(i)
                enddo
                !$OMP END PARALLEL DO
```
The do statement between !$OMP PARALLEL DO and !$OMP END PARALLEL DO is parallelized. For example, if four cores are paralleized, 1<i<25 is the first core, 26<i<50 is the second, 51<i<75 is the third, 76<i<100 is the fourth, and finally in !$OMP END PARALLEL DO, the data of each ais merged in DO.
```
                real(8) :: a(1:100,1:100),b(1:100,1:100),c(1:100,1:100)
                omit
                !$OMP PARALLEL DO
                do j=1,100
                  do i=1,100
                    a(i,j)=b(i,j)*c(i,j)
                  enddo
                enddo
                !$OMP END PARALLEL DO
```
If the do statements are nested, parallel processing is performed on the outer do statement (j in this case).
It is important to note that parallelization of non-independent do statements can produce strange results. For example
```
                real(8) :: a(1:100),b(1:99)
                omit
                !$OMP PARALLEL DO
                do i=1,99
                  a(i+1)=a(i)+b(i)
                enddo
                !$OMP END PARALLEL DO
```
the result a(i) calculated with the previous i is used to calculate a(i+1), but a(26) in a 4-core parallel system, for example, uses the result of a(25), and a(26) is calculated by the second core, a(25) by the first core. The core is calculating a(26), so a(26) does not give the correct result (if a(26) is not correct, everything after that is incorrect).
Another example is the famous sum calculation
```
                real(8) :: s,a(1:100)
                omit
                s=0.d0
                !$OMP PARALLEL DO
                do i=1,100
                  s=s+a(i)
                enddo
                !$OMP END PARALLEL DO
```
each core calculates the sum of a(i) in different regions of i independently (memory regions for s are provided by each core independently and cannot be shared), and at the end, different s of different cores cannot be integrated and strange results are obtained. In this case, the following array formula can be used. In this case, use the example array expression below, or use ATOMIC to
```
                real(8) :: s,a(1:100)
                omit
                a=0.d0
                !$OMP PARALLEL DO
                do i=1,100
                  !$OMP ATOMIC
                  s=s+a(i)
                enddo
                !$OMP END PARALLEL DO
```
If you want to do this (with !$OMP ATOMIC is used for the next line, with a common memory modification. The use of array expressions is recommended for speed and stability. ATOMIC should be used only when array expressions cannot be used).

Parallelization of array expressions

Array expressions can also be easily parallelized. For example

                real(8) :: a(1:100),b(1:100),c(1:100)
                omit
                !$OMP WORKSHARE
                a(1:100)=b(1:100)+c(1:100)
                !$OMP END WORKSHARE

Array expressions between !$OMP WORKSHARE and !$OMP END WORKSHARE are parallelized. Array function an also be parallelized, e.g.

                real(8) :: s,a(1:100)
                omit
                !$OMP WORKSHARE
                s=sum(a(1:100))
                !$OMP END WORKSHARE

the sum of the different ranges of a is computed in each core, and the sum of a is obtained by summing the computed sums in !$OMP END WORKSHARE.

To compile a program using OpenMP, use the compile option -openmp.

                ifort -openmp hoge.f90

Also, the number of cores must be given to the environment variable OMP_NUM_THREADS to actually make parallel computing. An example of a job script is

                #!/bin/bash
                #PBS -q batch
                #PBS -e error.log
                #PBS -o out.log
                #PBS -l nodes=1:ppn=4
                #PBS -l walltime=00:10:00:00
                cd $PBS_O_WORKDIR
                export OMP_NUM_THREADS=4
                ./out.exe

and specify the number of parallel cores in the ppn in line 4 and the environment variable in line 7.

(Caution!) When used with ulimit, the environment variable must be specified before ulimit to work properly. In other words,

                #!/bin/bash
                #PBS -q batch
                #PBS -e error.log
                #PBS -o out.log
                #PBS -l nodes=1:ppn=4
                #PBS -l walltime=00:10:00:00
                cd $PBS_O_WORKDIR
                ulimit -s unlimited
                export OMP_NUM_THREADS=4
                ./out.exe

is not good enough

                #!/bin/bash
                #PBS -q batch
                #PBS -e error.log
                #PBS -o out.log
                #PBS -l nodes=1:ppn=4
                #PBS -l walltime=00:10:00:00
                cd $PBS_O_WORKDIR
                export OMP_NUM_THREADS=4
                ulimit -s unlimited
                ./out.exe

and must be
Also using $PBS_NUM_PPN as

                #!/bin/bash
                #PBS -q batch
                #PBS -e error.log
                #PBS -o out.log
                #PBS -l nodes=1:ppn=4
                #PBS -l walltime=00:10:00:00
                cd $PBS_O_WORKDIR
                export OMP_NUM_THREADS=$PBS_NUM_PPN
                ./out.exe

If the program terminates with an error due to a segmentation violation only when OpenMP parallel is executed, there may be insufficient stack space.Use the environment variable $OMP_STACKSIZE

                export OMP_STACKSIZE=(Stack area : KB unit)

to increase the stack area.Try starting with about 512MB (512000), depending on the program size.

Parallelization of the random number generation library mkl_vsl using OpenMP

Parallelization with the do statement is very slow because it calls random numbers one by one. Here we call multiple random number seeds and parallelize them manually.

                program main
                  use mkl_vsl_type
                  use mkl_vsl
                  use omp_lib
                !
                  implicit none
                  integer,parameter :: n=1000
                  integer :: iprocs,nprocs,nprocsm
                  integer,allocatable :: nru(:),nrd(:),nrp(:)
                  integer :: statvsl,irand,cr,cm
                  type(vsl_stream_state),allocatable :: planvsl(:)
                  real(8) :: fake(1),frand(0:n-1)
                !
                  !$OMP PARALLEL
                  nprocs=omp_get_max_threads()
                  !$OMP END PARALLEL
                !
                  if(mod(n,nprocs)==0) then
                    nprocsm=nprocs-1
                    allocate(nru(0:nprocsm),nrd(0:nprocsm),nrp(0:nprocsm))
                    do iprocs=0,nprocsm
                      nrp(iprocs)=n/nprocs
                      nrd(iprocs)=nrp(iprocs)*iprocs
                      nru(iprocs)=nrp(iprocs)*(iprocs+1)-1
                    enddo
                  else
                    nprocsm=n/(n/nprocs+1)
                    allocate(nru(0:nprocsm),nrd(0:nprocsm),nrp(0:nprocsm))
                    do iprocs=0,nprocsm
                      nrp(iprocs)=n/nprocs+1
                      nrd(iprocs)=nrp(iprocs)*iprocs
                      nru(iprocs)=nrp(iprocs)*(iprocs+1)-1
                    enddo
                    nrp(nprocsm)=n-nrd(nprocsm)
                    nru(nprocsm)=n-1
                  endif
                !
                  allocate(planvsl(0:nprocsm))
                  do iprocs=0,nprocsm
                    call system_clock(irand,cr,cm)
                    statvsl=vslnewstream(planvsl(iprocs),vsl_brng_mt19937,irand+iprocs)
                    statvsl=vdrnggaussian(vsl_method_dgaussian_icdf,planvsl(iprocs),1,fake,0.d0,1.d0)
                  enddo
                !
                  !$OMP PARALLEL
                  iprocs=omp_get_thread_num()
                  if(iprocs<=nprocsm) then
                    statvsl=vdrnguniform(vsl_method_duniform_std,planvsl(iprocs),nrp(iprocs),frand(nrd(iprocs)),0.d0,1.d0)
                  !$OMP END PARALLEL
                  stop
                end program main

The iprocs-th processor generates random numbers in the nrp(iprocs) area from frand(nrd(iprocs)) to frand(nru(iprocs)). The disadvantage of this program is that OMP_NUM_THREADS=1 must be properly set even if parallelization is not used.

OpenMPI

In OpenMPI parallel, all cores will execute the same program. An example of the simplest OpenMPI program is shown below.
```
                program main
                  implicit none
                  include 'mpif.h'
                  integer :: ierr,nprocs,myrank
                !
                  call mpi_init(ierr)
                  call mpi_comm_size(mpi_comm_world,nprocs,ierr)
                  call mpi_comm_rank(mpi_comm_world,myrank,ierr)
                !
                  write(*,*) myrank,nprocs
                !
                  call mpi_finalize(ierr)
                  stop
                end program main
```
In the third line, a file defining a group of functions necessary for MPI parallel is read. Note that MPI parallel cannot be executed without reading this file. If you want to run MPI parallel on another system, you need to rewrite mpif.h file in the correct location because the location of mpif.h file differs depending on the system.
- ierr : integer to store error messages.
- nprocs : integer to store the number of all MPI parallel.
- myrank : integer to store how many of all MPI paralleld numbers you are.
- Declare that the MPI parallel starts at line 6. From here until mpi_finalize on line 12 will be parallelized.
- The total number of parallels is stored in nprocs by the MPI built-in function mpi_comm_size on line 7. mpi_comm_world is a variable called MPI communicator, which can be thought of as a kind of spell.
- The MPI built-in function mpi_comm_rank in line 8 stores the number of itself out of the total parallel count in myrank. In other words, myrank is assigned a different value for each computation core and is the most important value for parallel computation. 0≤myrank≤nprocs-1.
- The end of MPI parallelism is declared in mpi_finalize on line 12. When all cores have finished their calculations, the next line is to be followed. Without this one line, the first core to finish its calculation would execute the 13th stop line, and the program would terminate even though the other cores are still in the middle of their calculations.Therefore, this line is very important.
Line 10 writes myrank and nprocs to standard output.If this program is executed in 4 parallel, we get
```
                0 4
                1 4
                2 4
                3 4
```
(the order of the displayed lines may vary). To compile a program written for MPI parallel computation, use mpif90 instead of ifort (or mpif77 if you are writing in FORTRAN77). An example of a job script for MPI parallel computation is
```
                #!/bin/bash
                #PBS -q batch
                #PBS -e error.log
                #PBS -o out.log
                #PBS -l nodes=4:ppn=1
                #PBS -l walltime=00:10:00:00
                cd $PBS_O_WORKDIR
                export OMP_NUM_THREADS=1
                mpirun ./out.exe
```
Use mpirun for execution. Note the difference between nodes and ppn. ppn specifies how much to parallelize within a node, while nodes specifies how much to parallelize between nodes. For example, if nodes=1:ppn=4, maui will find the fastest node at the moment and run intra-node parallel computation with a parallel number of 4 on that node.Conversely, if nodes=4:ppn=1, maui assigns jobs to nodes one by one in parallel and measures the speed of each node as needed. When the allocation of cores is determined, if all cores are allocated to the same node, it will be intra-node parallel (i.e., almost the same as when nodes=1:ppn=4), and if they are allocated to multiple nodes, it will be inter-node parallel.The total number of parallels = nodes x ppn. For example, if nodes=2:ppn=4, the calculation will be 8-parallel, as if 4 intra-node parallels were performed by 2 nodes each (for nodes=2, if maui assigns calculation to the same node, it will be 8 intra-node parallels).Since inter-node parallelism involves data communication via LAN cables, the more data communication there is, the worse the parallel efficiency becomes.

Parallelization of completely independent do statements

This is what is called "idiot parallelism," i.e., parallelization that can be substituted by submitting multiple jobs individually without parallelization. For example, if you want to perform the exact same calculation with different parameters

                program main
                  implicit none
                  include 'mpif.h'
                  integer :: ierr,nprocs,myrank
                !
                  integer :: it
                  real(8) :: t
                  define other variables
                  
                  call mpi_init(ierr)
                  call mpi_comm_size(mpi_comm_world,nprocs,ierr)
                  call mpi_comm_rank(mpi_comm_world,myrank,ierr)
                !
                  do it=myrank,100,nprocs
                    t=dble(it)/100.d0
                    calculation
                  enddo
                !
                  call mpi_finalize(ierr)
                  stop
                end program main

This is the simplest parallelization of the way the parameters are moved in parallel when the calculation is performed by moving t in increments of 0.01 for parameters 0≤t≤1. The point to note in this parallel computation is that the 101 t calculations are equally allocated to each core, so that the fastest core will finish first and wai forever, while the slowest core will compute forever, espacially if inter-node parallelism is used. To avoid this, the following improvements can be considered.

                program main
                  implicit none
                  include 'mpif.h'
                  integer :: ierr,nprocs,myrank
                  character(len=20) :: crank
                !
                  integer :: it
                  real(8) :: t
                  defined other variables
                  
                  call mpi_init(ierr)
                  call mpi_comm_size(mpi_comm_world,nprocs,ierr)
                  call mpi_comm_rank(mpi_comm_world,myrank,ierr)
                !
                  it=myrank
                  if(myrank==0) then
                    open(10,file='next',form='binary')
                    write(10) nprocs
                    close(10)
                  endif
                  write(crank,*) myrank
                  call system('sleep '//crank)
                !
                  do
                    t=dble(it)/100.d0
                    calculation
                    open(10,file='next',form='binary',status='old')
                    read(10) it
                    close(10)
                    if(it>100) exit
                    open(10,file='next',form='binary')
                    write(10) it+1
                    close(10)
                  enddo
                !
                  call mpi_finalize(ierr)
                  stop
                end program main

This is a program to calculate the next t in turn, starting with the core that has completed the calculation of the do statement in each t. The core that comes to the end of the do statement reads "next" to know the value of the next "it" and overwrites "next" with "it+1". In order to prevent multiple cores from concentrating on writing to and reading from next, the calculation is stopped for (myrank) seconds just before the do statement, and the timing of the calculation is changed slightly for each core. The timing of computation is slightly staggered for each core.

When one wants to earn an ensemble in a Monte Carlo calculation, etc., a parallel calculation can be considered, in which the calculations for each ensemble are parallelized. In this case, it is necessary to add up the values calculated by each core at the end in order to calculate the average value, etc. If we rewrite the previous example one step further

                program main
                  implicit none
                  include 'mpif.h'
                  integer :: ierr,nprocs,myrank
                  character(len=20) :: crank
                !
                  integer :: it
                  real(8) :: t,s,st
                  define other variables
                !
                  call mpi_init(ierr)
                  call mpi_comm_size(mpi_comm_world,nprocs,ierr)
                  call mpi_comm_rank(mpi_comm_world,myrank,ierr)
                !
                  it=myrank
                  if(myrank==0) then
                    open(10,file='next.bin',form='binary')
                    write(10) nprocs
                    close(10)
                  endif
                  write(crank,*) myrank
                  call system('sleep '//crank)
                !
                  s=0.d0
                  do
                    calculation
                    s=s+st
                !
                    open(10,file='next.bin',form='binary',status='old')
                    read(10) it
                    close(10)
                    if(it>100) exit
                    open(10,file='next.bin',form='binary')
                    write(10) it+1
                    close(10)
                  enddo
                  mpi_allreduce(mpi_in_place,s,1,mpi_double_precision,mpi_sum,mpi_comm_world,ierr)
                !
                  if(myrank==0) then
                    open(11,file='average.dat',form='formatted')
                    write(11,'(e26.16e3)') s/101.d0
                    close(11)
                  endif
                !
                  call mpi_finalize(ierr)
                  stop
                end program main

This makes 101 ensembles compute in parallel and adds up the value st per ensemble on each core (=s). Finally, all the s's for each core need to be added together, and this is where data communication is performed. The command for this is mpi_allreduce, which is used in the form

                mpi_allreduce(variable to communicate,variable to store the result,number of data to communicate,data type,operation to communicate,mpi communicator,error message)

If you want to make the variable you want to communicate with and the variable that stores the result the same, set mpi_in_place for the variable you want to communicate with. In this way, the sum of s of each node is calculated and stored in s of each node. The data to be communicated may be an array. In this case, specify the first address of the array to be communicated and the number of data to be communicated. Data types include mpi_integer, mpi_double_precision, and mpi_double_complex. In the example above, the data of each core is added together during data communication, in which case it becomes mpi_sum. In addition, various other operations can be specified, such as multiplying by the product, as in mpi_product. If an array is specified, the sum and product operations are performed on each element of the array. See the manual for details. Basically, when you specify a command for communication, it is executed when all cores come to the part of the command (i.e., the faster cores are waiting there), but if you want to intentionally synchronize cores (keep them in step), you can use

                mpi_barrier(mpi_comm_world,ierr)

Parallelization of non-independent do statements: the finite-difference method

The finite-difference method is usually no problem with OpenMP, but there are cases where you really want to use OpenMPI for random number generation, memory savings, etc., so we have included it here. The finite-difference method is also a good practice for MPI parallel codes. Consider a 2-dimensional lattice N_x X N_y and consider parallelizing N_y. Also, assume that N_y/M is divisible when the number of parallelism is M.

                program main
                  implicit none
                  include 'mpif.h'
                  integer :: ierr,nprocs,myrank
                  integer :: istatus(mpi_status_size)
                  integer :: kup,kdown
                !
                  integer,parameter :: nx=100,ny=100
                  integer,parameter :: nxm=nx-1,nym=ny-1
                  integer :: ix,iy
                  integer :: nyi,nyf,nyr
                  real(8),allocatable :: f(:,:),fd(:,:)
                !
                  call mpi_init(ierr)
                  call mpi_comm_size(mpi_comm_world,nprocs,ierr)
                  call mpi_comm_rank(mpi_comm_world,myrank,ierr)
                !
                  kup=myrank+1
                  kdown=myrank-1
                  if(myrank==0) kdown=mpi_proc_null
                  if(myrank==nprocs-1) kup=mpi_proc_null
                !
                  if(mod(ny,nprocs)/=0) stop
                  nyr=ny/nprocs
                  nyi=myrank*nyr
                  nyf=(myrank+1)*nyr-1
                !
                  allocate(f(-1:nx,nyi-1:nyf+1),fd(0:nxm,nyi:nyf))
                !
                  do iy=nyi,nyf
                    do ix=0,nxm
                      f(ix,iy)=assign value to f
                    enddo
                  enddo
                !
                  do iy=nyi,nyf
                    f(-1,iy)=0.d0
                    f(nx,iy)=0.d0
                  enddo
                  if(myrank==0) then
                    do ix=0,nxm
                      f(ix,-1)=0.d0
                    enddo
                  else if(myrank==nprocs-1) then
                    do ix=0,nxm
                      f(ix,ny)=0.d0
                    enddo
                  endif
                !
                  call mpi_sendrecv(f(0,nyf),nx,mpi_double_precision,kup,1,f(0,nyi-1),nx,mpi_double,precision,kdown,1,mpi_comm_world,istatus,ierr)
                  call mpi_sendrecv(f(0,nyi),nx,mpi_double_precision,kdown,1,f(0,nyf+1),nx,mpi_double,precision,kup,1,mpi_comm_world,istatus,ierr)
                !
                  do iy=nyi,nyf
                    do ix=0,nxm
                      fd(ix,iy)=f(ix+1,iy)+f(ix-1,iy)+f(ix,iy+1)+f(ix,iy-1)-4.d0*f(ix,iy)
                    enddo
                  enddo
                !
                  call mpi_finalize(ierr)
                  stop
                end program main

Parallelizing non-independent do statements : Fourier transform

Parallelizing 3D Fourier transform using routines of Fourier transform for single nodes. If you use Intel's MKL, there is a Fourier transform routine for MPI, but the manual is very inadequate, and whether it works or not depends very much on the environment, espacially on the version of MPI (is this really useful?). So, I tried to make my own program. Note that this approach is not optimal at all and took about 10% extra time compared to Intel's MKL routine. For simplicity, assume that the space is cubic and that N_z/M is divisible as in the finite-difference method (N_x=N_y=N_z).

                program main
                  implicit none
                  include 'mpif.h'
                  integer :: ierr,ireq,nprocs,nprocsm,myrank
                  integer :: istatus(mpi_status_size)
                !
                  integer,parameter :: nx=512
                  integer,parameter :: nxm=nx-1
                  integer :: ix,iy,iz,iranky,irankz
                  integer :: nzi,nzf,nzr
                  integer,allocatable :: nzio(:),nzfo(:)
                  complex(8),allocatable :: zf(:,:,:),zft(:,:,:),zf1(:)
                !
                  call mpi_init(ierr)
                  call mpi_comm_size(mpi_comm_world,nprocs,ierr)
                  call mpi_comm_rank(mpi_comm_world,myrank,ierr)
                  nprocsm=nprocs-1
                !
                  if(mod(nx,nprocs)/=0) stop
                  nzr=nx/nprocs
                  nzi=myrank*nzr
                  nzf=(myrank+1)*nzr-1
                !
                  allocate(nzio(0:nprocsm),nzfo(0:nprocsm))
                  do irankz=0,nprocsm
                    nzio(irankz)=irankz*nzr
                    nzfo(irankz)=(irankz+1)*nzr-1
                  enddo
                !
                  allocate(zf(0:nxm,0:nxm,nzi:nzf),zft(0:nxm,nzi:nzf,0:nxm),zf1(0:nxm))
                !
                  do iz=nzi,nzf
                    do iy=0,nxm
                      do ix=0,nxm
                        zf(ix,iy,iz)=assign value to zf
                      enddo
                    enddo
                  enddo
                !
                  do iz=nzi,nzf
                    call forwardfft2d(zf(0,0,iz)) !2D Fourier transform for single node
                  enddo
                !
                  do irankz=0,nprocsm
                    do iranky=0,nprocsm
                      if(irankz==myrank) then
                        if(iranky==myrank) then
                          do iz=nzi,nzf
                            do iy=nzi,nzf
                              do ix=0,nxm
                                zft(ix,iy,iz)=zf(ix,iy,iz)
                              enddo
                            enddo
                          enddo
                        else
                          do iz=nzi,nzf
                            do iy=nzio(iranky),nzfo(iranky)
                              call mpi_isend(zf(0,iy,iz),nx,mpi_double_complex,iranky,1,mpi_comm_world,ireq,ierr)
                              call mpi_wait(ireq,istatus,ierr)
                            enddo
                          enddo
                        endif
                      else if(iranky==myrank) then
                        do iz=nzio(irankz),nzfo(irankz)
                          do iy=nzi,nzf
                            call mpi_irecv(zft(0,iy,iz),nx,mpi_double_complex,iranky,1,mpi_comm_world,ireq,ierr)
                            call mpi_wait(ireq,istatus,ierr)
                          enddo
                        enddo
                      endif
                    enddo
                  enddo
                !
                  do iy=nzi,nzf
                    do ix=0,nxm
                      zf1(iz)=zft(ix,iy,iz)
                    enddo
                    call forwardfft1d(zf1) !1D Fourier transform for single node
                    do ix=0,nxm
                      zft(ix,iy,iz)=zf1(iz)
                    enddo
                  enddo
                !
                  do irankz=0,nprocsm
                    do iranky=0,nprocsm
                      if(irankz==myrank) then
                        if(iranky==myrank) then
                          do iz=nzi,nzf
                            do iy=nzi,nzf
                              do ix=0,nxm
                                zf(ix,iy,iz)=zft(ix,iy,iz)
                              enddo
                            enddo
                          enddo
                        else
                          do iz=nzi,nzf
                            do iy=nzio(iranky),nzfo(iranky)
                              call mpi_irecv(zf(0,iy,iz),nx,mpi_double_complex,iranky,1,mpi_comm_world,ireq,ierr)
                              call mpi_wait(ireq,istatus,ierr)
                            enddo
                          enddo
                        endif
                      else if(iranky==myrank) then
                        do iz=nzio(irankz),nzfo(irankz)
                          do iy=nzi,nzf
                            call mpi_isend(zft(0,iy,iz),nx,mpi_double_complex,iranky,1,mpi_comm_world,ireq,ierr)
                            call mpi_wait(ireq,istatus,ierr)
                          enddo
                        enddo
                      endif
                    enddo
                  enddo
                !
                  call mpi_finalize(ierr)
                  stop
                end program main

Parallelization of non-independent do statements: replica exchange Monte Carlo

Parallelization of replica exchange Monte Carlo. The parallelization is to perform calculations for different temperatures on different nodes while exchanging temperatures within or between nodes as needed (replacing the system instead of temperatures should never be done because of the large data transfer that would occur). To be continued at a later date.

There are cases where the results of parallel computation with MPI and without MPI are slightly different, even though the computation is exactly the same (in the case of chaos simulations, the results are completely different after a long time of evolution). One of the biggest reasons for this difference is the inter-node operations using mpi_reduce (because the order of the operations changes). For example, when the sum of double-precision real numbers using mpi_sum is compared with the sum of double-precision real numbers in the node at +, a difference is produced. If you want to avoid this, you can use mpi_sendrecv or mpi_gather to transfer data and then perform the necessary operations in the node.Neither of them is correct, though.

FFT

Input and output of Fourier transform
$ \displaystyle \qquad B(K) = \sum_X e^{- i K X} A(X) $ $ \displaystyle \qquad A(X) = \frac{1}{N} \sum_K e^{i K X} B(K) $
In wavenumber$K$, $K = 2 \pi J / N\ (0 \leq J \leq N / 2)$ and $K = - 2 \pi (N - J) / N\ (N / 2 + 1 \leq J \leq N - 1)$. Also, the factor $1 / N$ should be put in by hand. For mkl's library
```
                statdft=dftisetvalue(plandft,dfti_backward_scale,1.d0/dble(N))
```
The scale factor can be included as a parameter.

Here is an example of a module for a standard 2D FFT.

FFT using Intel Math Kernel Library

                module intel_fft
                  use mkl_dfti
                !
                  implicit none
                  integer,parameter :: nx=512,ny=512
                  integer,parameter :: nxy=nx*ny
                  type(dfti_descriptor),pointer,save,private :: plandft
                  integer,save,private :: statdft
                !
                !
                  contains
                !
                !
                  subroutine intel_fft_ini
                    statdft=dfticreatedescriptor(plandft,dfti_double,dfti_complex,2,(/nx,ny/))
                    statdft=dftisetvalue(plandft,dfti_backward_scale,1.d0/dble(nxy))
                    statdft=dfticommitdescriptor(plandft)
                    return
                  end subroutine intel_fft_ini
                !
                !
                  subroutine forwardfft(zfunc)
                    complex(8),intent(inout) :: zfunc(nxy)
                    statdft=dfticomputeforward(plandft,zfunc)
                  end subroutine forwardfft
                !
                !
                  subroutine backwardfft(zfunc)
                    complex(8),intent(inout) :: zfunc(nxy)
                    statdft=dfticomputebackward(plandft,zfunc)
                  end subroutine backwardfft
                end module intel_fft

Run intel_fft_ini once at the beginning of the programonce at the beginning of the program. Unless you have a special reason (use real sine/cosine conversion, use quadruple precision), you should have no problem choosing this one, and OpenMP parallelization will be done automatically. To compile, create a directory (mod in this case) where you want to store the module and type

                ifort /opt/intel/compilers_and_libraries/linux/mkl/include/mkl_dfti.f90 -c (additional options)  -module mod
                ifort hoge.f90 -c -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64/ -mkl -lpthread (additional options)  -module mod
                ifort hoge.o -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64/ -mkl -lpthread (additional options)  -module mod -o hoge.exe

Notes on handling Intel MKL Real DFT CCE format
When performing a real high-dimensional Fourier transform using Intel's MKL, there are four formats for storing Fourier components: CCE, CCS, Pack, and Perm. Pack an Perm are the same array for real and Fourier components (You have to refer to the manual for the storage method). Personally, I prefer Perm. CCS is out of the question. However, CCS, Pack, and Perm only support up to 2 dimensionas and are not usable in 3 dimensions; CCE stores the Fourier components in the same place as the complex Fourier transform and its array size is $(n_x/2+1) \times n_y$ for $ n_x \times n_y$ real input. Thus we have an extra array. Specifically, for a 2D FFT, the Fourier components are $ B(i,j) $ when
Imaginary parts of $ B(0,0) $, $ B(n_x/2,0) $, $ B(0,n_y/2) $, $ B(n_x/2,n_y/2) $ are $ 0 $
$ B(0,n_y/2+1 \le j \le n_y-1) $ is conjugate to $ B(0,n_y-j) $
$ B(n_x/2,n_y/2+1 \le j \le n_y-1) $ is conjugate to $ B(n_x/2,n_y-j) $
in 3 dimensions
Imaginary parts of $ B(0,0,0) $, $ B(n_x/2,0,0) $, $ B(0,n_y/2,0) $, $ B(n_x/2,n_y/2,0) $, $ B(0,0,0) $, $ B(n_x/2,0,n_z/2) $, $ B(0,n_y/2,n_z/2) $, $ B(n_x/2,n_y/2,n_z/2) $ are $ 0 $
$ B(0,n_y/2+1 \le j \le n_y-1,0) $ is conjugate to $ B(0,n_y-j,0) $
$ B(0,n_y/2+1 \le j \le n_y-1,n_z/2) $ is conjugate to $ B(0,n_y-j,n_z/2) $
$ B(n_x/2,n_y/2+1 \le j \le n_y-1,0) $ is conjugate to $ B(n_x/2,n_y-j,0) $
$ B(n_x/2,n_y/2+1 \le j \le n_y-1,n_z/2) $ is conjugate to$ B(n_x/2,n_y-j,n_z/2) $
$ B(0,0 \le j \le n_y-1,n_z/2+1 \le k \le n_z-1) $ is conjugate to $ B(0,n_y-j,n_z-k) $
$ B(n_x/2,0 \le j \le n_y-1,n_z/2+1 \le k \le n_z-1) $ is conjugate to $ B(n_x/2,n_y-j,n_z-k) $
Forward Fourier transform is no problem, but when performing the inverse Fourier transform, if the Fourier components do not satisfy the above properties, strange results are obtained. Apparently there is no such thing as unreferenced data. If the Fourier transform is only used for filtering, data compression, or to find differential and integral calculus, it seems that the above properties are not often broken, but when solving the Langevin equation in wavenumber space, for example, it is a big problem if the above properties are not satisfied. What I'm tryn to say is Put this much in manual. Also since the real component is real and the Fourier component is complex in CCE format, the resulting arrangement should be straightforward and not overwritten (dfti_not_inplace) . It is important to note that the stride is usually different for the real and Fourier components, and while the forward Fourier transform has the real component as input and the Fourier component as output, the inverse Fourier transform has the opposite, so different descriptors must be generated for the forward and inverse direction. For example,
```
                implicit none
                type(dfti_descriptor),pointer,save,private :: plandftf,plandftb
                integer,save,private :: statdftf,statdftb
                !
                statdftf=dfticreatedescriptor(plandftf,dfti_double,dfti_real,3,(/nx,ny,nz/))
                statdftf=dftisetvalue(plandftf,dfti_conjugate_even_storage,dfti_complex_complex)
                statdftf=dftisetvalue(plandftf,dfti_placement,dfti_not_inplace)
                statdftf=dftisetvalue(plandftf,dfti_input_strides,(/0,1,nx,nx*ny/))
                statdftf=dftisetvalue(plandftf,dfti_output_strides,(/0,1,nx/2+1,(nx/2+1)*ny/))
                statdftf=dfticommitdescriptor(plandftf)
                !
                statdftb=dfticreatedescriptor(plandftb,dfti_double,dfti_real,3,(/nx,ny,nz/))
                statdftb=dftisetvalue(plandftb,dfti_conjugate_even_storage,dfti_complex_complex)
                statdftb=dftisetvalue(plandftb,dfti_placement,dfti_not_inplace)
                statdftb=dftisetvalue(plandftb,dfti_backward_scale,1.d0/dble(nx*ny*nz))
                statdftb=dftisetvalue(plandftb,dfti_input_strides,(/0,1,nx/2+1,(nx/2+1)*ny/))
                statdftb=dftisetvalue(plandftb,dfti_output_strides,(/0,1,nx,nx*ny/))
                statdftb=dfticommitdescriptor(plandftb)
```
Forward Fourier transform should be plandftf and inverse Fourier transform should be plandftb. Whan I'm trying to say is Put this much in manual. This kind of problem does not occur in the Perm format (or Pack) . By the way if you want to use Perm format
```
                type(dfti_descriptor),pointer,save,private :: plandft
                integer,save,private :: statdft
                statdft=dfticreatedescriptor(plandft,dfti_double,dfti_real,2,(/nx,ny/))
                statdft=dftisetvalue(plandft,dfti_backward_scale,1.d0/dble(nx*ny))
                statdft=dftisetvalue(plandft,dfti_real_storage,dfti_real_complex)
                statdft=dftisetvalue(plandft,dfti_packed_format,dfti_pack_format)
                statdft=dfticommitdescriptor(plandft)
```
If you really want to use Perm format for 3D Fourier transform, you can use the one distributed at Oura's site in Kyoto University. (DCT is supported, so you can use it if you want to use DCT. I mean, why doesn't Intel support DCT while charging a ridiculous amount of money for it?)

FFT using FFTW (version 3)

FFTW itself can be installed after unzipping the files (on Ubuntu it can be installed from the Ubuntu Software Center) by moving to the created directory

                ./configure
                make
                make install

If you plan to parallelize with OpenMP or OpenMPI, the first line should be

                ./configure --enable-threads --enable-mpi

The module is somewhat cumbersome, and the array to be Fourier transformed must be specified in advance.

                module fftw_fft
                  use,intrinsic :: iso_c_binding
                !
                  implicit none
                  include '/usr/local/include/fftw3.f03'
                  integer,parameter :: nx=512,ny=512
                  integer,parameter :: nxy=nx*ny
                  type(c_ptr),save,private :: plandftf,plandftb
                  real(8),private :: xlxy
                  complex(8),private :: zf(nx,ny)
                !
                !
                  contains
                !
                !
                  subroutine fftw_fft_ini
                    plandftf=fftw_plan_dft_2d(nx,ny,zf,zf,fftw_forward,fftw_measure)
                    plandftb=fftw_plan_dft_2d(nx,ny,zf,zf,fftw_backward,fftw_measure)
                    xlxy=dble(nxy)
                    return
                  end subroutine fftw_fft_ini
                !
                !
                  subroutine forwardfft(zfunc)
                    complex(8),intent(inout) :: zfunc(nx,ny)
                    zf(1:nx,1:ny)=zfunc(1:nx,1:ny)
                    call fftw_execute_dft(plandftf,zf,zf)
                    zfunc(1:nx,1:ny)=zf(1:nx,1:nx)
                  end subroutine forwardfft
                !
                !
                  subroutine backwardfft(zfunc)
                    complex(8),intent(inout) :: zfunc(nx,ny)
                    zf(1:nx,1:ny)=zfunc(1:nx,1:ny)
                    call fftw_execute_dft(plandftb,zf,zf)
                    zfunc(1:nx,1:ny)=zf(1:nx,1:nx)/xlxy
                  end subroutine backwardfft
                end module fftw_fft

For parallelization with OpenMP, the number of parallelism must be specified in the program. If the value of the environment variable OMP_NUM_THREADS is used as the parallel number

                module fftw_fft
                  use,intrinsic :: iso_c_binding
                  use omp_lib
                !
                  implicit none
                  include '/usr/local/include/fftw3.f03'
                  integer,parameter :: nx=512,ny=512
                  type(c_ptr),save,private :: plandftf,plandftb
                  integer,private :: iret,nprocs
                  real(8),private :: xlxy
                  complex(8),private :: zf(nx,ny)
                !
                !
                  contains
                !
                !
                  subroutine fftw_fft_ini
                    !$OMP PARALLEL
                    nprocs=omp_get_max_threads()
                    !$OMP END PARALLEL
                !
                    call dfftw_init_threads(iret)
                    call dfftw_plan_with_nthreads(nprocs)
                !
                    plandftf=fftw_plan_dft_2d(nx,ny,zf,zf,fftw_forward,fftw_measure)
                    plandftb=fftw_plan_dft_2d(nx,ny,zf,zf,fftw_backward_fftw_measure)
                    xlxy=dble(nxy)
                    return
                  end subroutine fftw_fft_ini
                !
                !
                  subroutine forwardfft(zfunc)
                    complex(8),intent(inout) :: zfunc(nx,ny)
                    !$OMP WORKSHARE
                    zf(1:nx,1:ny)=zfunc(1:nx,1:ny)
                    !$OMP END WORKSHARE
                    call fftw_execute_dft(plandftf,zf,zf)
                    !$OMP WORKSHARE
                    zfunc(1:nx,1:ny)=zf(1:nx,1:nx)
                    !$OMP END WORKSHARE
                  end subroutine forwardfft
                !
                !
                  subroutine backwardfft(zfunc)
                    complex(8),intent(inout) :: zfunc(nx,ny)
                    !$OMP WORKSHARE
                    zf(1:nx,1:ny)=zfunc(1:nx,1:ny)
                    !$OMP END WORKSHARE
                    call fftw_execute_dft(plandftb,zf,zf)
                    !$OMP WORKSHARE
                    zfunc(1:nx,1:ny)=zf(1:nx,1:nx)/xlxy
                    !$OMP END WORKSHARE
                  end subroutine backwardfft
                end module fftw_fft

Also, if you want to parallelize the Fourier transform only in the x direction, you do not need to declare the number of parallelism for fftw.

                module fftw_fft
                  use,intrinsic :: iso_c_binding
                  use omp_lib
                !
                  implicit none
                  include '/usr/local/include/fftw3.f03'
                  integer,parameter :: nx=512,ny=512
                  type(c_ptr),save,private :: plandftf,plandftb
                  real(8),private :: xlx
                  complex(8),private :: zf(nx)
                !
                !
                  contains
                !
                !
                  subroutine fftw_fft_ini
                    plandftf=fftw_plan_dft_1d(nx,zf,zf,fftw_forward_fftw_measure)
                    plandftb=fftw_plan_dft_1d(nx,zf,zf,fftw_backward,fftw_measure)
                    xlx=dble(nx)
                    return
                  end subroutine fftw_fft_ini
                !
                !
                  subroutine forwardfft(zfunc)
                    complex(8),intent(inout) :: zfunc(nx,ny)
                    integer :: iy
                    !$OMP PARALLEL private(zf)
                    !$OMP DO
                    do iy=1,ny
                      zf(1:nx)=zfunc(1:nx,iy)
                      call fftw_execute_dft(plandftf,zf,zf)
                      zfunc(1:nx,iy)=zf(1:nx)
                    enddo
                    !$OMP END DO
                    !$OMP END PARALLEL
                  end subroutine forwardfft
                !
                !
                  subroutine backwardfft(zfunc)
                    complex(8),intent(inout) :: zfunc(nx,ny)
                    integer :: iy
                    !$OMP PARALLEL private(zf)
                    !$OMP DO
                    do iy=1,ny
                      zf(1:nx)=zfunc(1:nx,iy)
                      call fftw_execute_dft(plandftb,zf,zf)
                      zfunc(1:nx,iy)=zf(1:nx)/xlx
                    enddo
                    !$OMP END DO
                    !$OMP END PARALLEL
                  end subroutine backwardfft
                end fftw_fft

Compile without OpenMP (former example)

                ifort hoge.f90 -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64/ -L /usr/local/lib -lfftw3 -lm -mkl -lpthread (additional options)  -module mod -o hoge.exe

Compile with OpenMP (latter example)

                ifort hoge.f90 -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64/ -L /usr/local/lib -lfftw3_threads -lfftw3 -lm -mkl -lpthread -qopenmp (additional options)  -module mod -o hoge.exe

Speed Comparison

Let's compare the speed of some simple calculations (all were compared with the -O0 option, so the speed difference may be reduced with the -O3 option).

absolute square of complex number (about 10000000000 calculations on single core)
```
                cdabs(f)**2
```
average 27.5 seconds
```
                dreal(dconjg(f)*f)
```
average 9.8 seconds
```
                dreal(f)**2+dimag(f)**2
```
average 10.5 seconds I thought that the third one was less frequent, but surprisingly the second one is faster.However, the difference is small.
argument of complex numbers
```
                datan2(dimag(f),dreal(f))
```
average 37.8 seconds
```
                dimag(cdlog(f))
```
average 1 min 45.7 seconds
I thought the latter was slower, but it is surprising to see such a difference.
imaginary power
```
                dcmplx(dcos(f),dsin(f))
```
average 1 min 57.3 seconds
```
                cdexp(dcmplx(0.d0,f))
```
average 1 min 58.2 seconds
unlike the argument one, the difference here is smaller.

Gaussian random number creation (using Intel MKL 100000 random number generation calculated 100000 times)

                vdrnggaussian(vsl_rng_method_gaussian_icdf,stream,100000,f,0.d0,1.d0)

average 1 min 3.3 seconds

                vdrnggaussian(vsl_rng_method_gaussian_boxmuller,stream,100000,f,0.d0,1.d0)

average 2 minutes 44.7 seconds

                vdrnggaussian(vsl_rng_method_gaussian_boxmuller2,stream,100000,f,0.d0,1.d0)

average 1 min 47.1 seconds