Conversion between character and number
character → numbercharacter(len=2) :: ca integer :: a ca='10' read(ca,*) anumber → character
character(len=2) :: ca integer :: a a=10 write(ca,*) a
write string without specifying number of characters (without spaces)
character(len=100) :: ca ca='abcdefg' write(*,'(a)') trim(ca)
write out an integer without any digits (no spaces)
integer :: i i=10000 write(*,'(i0)') i
zero fill when writing out an integer
Display with 7 digits and up to 6 digits zero-fill displayinteger :: i i=12345 write(*,'(i7.6)') i
Automatically change character count
With allocatable, when the string is changed, the character count is also changed automaticallycharacter(:),allocatable :: ca ca='abcdefg' write(*,*) len(ca) ca='abcdefghijklmn' write(*,*) len(ca)
Add something in once closed file
position='append' in the open statement. However, only valid for formattedMaximum and minimum values for numbers
type | minimum | maxmum | significant digits |
---|---|---|---|
integer(1) | -128 | 127 | |
integer(2) | -32768 | 32767 | |
integer(4) | -2147483648 | 2147483648 | |
integer(8) | -9223372936854775808 | 9223372936854775808 | |
real(4) | 1.17549435e-38 | 3.40282347e38 | 8 |
real(8) | 2.2250738585072013d-308 | 1.7976931348623158d308 | 16 |
Debugging options
No optimization at allifort -O0 hoge.f90Check for segmentation errors due to out-of-area use of arrays or other memory abuse
ifort -CB -traceback hoge.f90Check for zero division or value overflow
ifort -fpe0 -traceback hoge.f90When executed with the above options, a detailed error message is displayed and calculation stops.
Optimization options
Automatic vectorizationifort -xHost hoge.f90Compile while embedding all subroutine call parts.
ifort -ipo hoge.f90If you have a very large number of subroutines, you can hope for some speedup (Note: -ipo option is not recommended. The order of execution of the subroutines may change internally and you may not get the desired result. It is true that -ipo sometimes makes it faster, but it is better to give up than to spend extra time suffering from the -ipo bug.)
ifort -O1 hoge.f90 ifort -O2 hoge.f90 ifort -O3 hoge.f90any of the following: O1 → O2 → O3, in that order the optimization level increases, but O3 is not always faster than O2 (it depends on how good the program is written). Also, depending on how the program is written, increasing the optimization level may cause bugs. The default is O2
Faster division
ifort -no-prec-div hoge.f90Division takes the longest time among addition, subtraction and multiplication. Therefore, we can speed up division by reducing the precision of division only (it is better to reduce the number of divisions as much as possible. For example, instead of dividing by 2, multiply by 0.5, or use bit shifting for integers).
In the case of Fortran, when an array of two or more dimensions is prepared, the order of memory actually prepared is the inner side first. For example
real(8) :: a(3,3)the actual order of allocated memory is
a(1,1) a(2,1) a(3,1) a(1,2) a(2,2) a(3,2) a(1,3) a(2,3) a(3,3)(same as julia, but opposite to c/c++, java, python, etc.) If you ever manipulate an array using a do statement, it is faster to manipulate it according to memory order. For example,
do i=1,3 do j=1,3 a(i,j)=i+j enddo enddoand
do j=1,3 do i=1,3 a(i,j)=i+j enddo enddoThe operations performed are the same, but the latter is faster than the former because it operates according to the memory order.
Staircase function \( \theta(x) \)
0.5d0*(dsign(1.d0,x)+1.d0)However, there are many cases where using if statements is faster.The only advantage may be that it can be written neatly on a single line.
Built-in function atan2(y,x)=atan(y/x)
Note that it is not atan2(x,y) (same as gnuplot, opposite of Mathematica)Writing and reading data is often the bottleneck of numerical calculation. For example, array data can be written out
open(10,file='hoge.dat') do j=1,3 do i=1,3 write(10,*) a(i,j) enddo enddo close(10)instead of writing data one by one like
open(10,file='hoge.dat') write(10,*) ((a(i,j),i=1,3),j=1,3) close(10)If the data can be written out together like this, it is faster.
There are two types of data files: human-readable (ascii) and PC-readable (binary). If the data does not need to be read by humans (e.g., just passing the data to another program), writing out the data in binary format will result in surprisingly small writing time and file size (binary is also much faster to read). Incidentally, AVS/Express and Python's Numpy can read data files in binary format. It seems that with some training, gnuplot can also read binary data files. Also, Paraview requires the anomalous technique of mixing ASCII when reading in binary data. Example of writing data in binary format
open(10,file='hoge',form='binary') write(10) ((a(i,j),i=1,3),j=1,3) close(10)You can do this. It is better not to add an extension, or if you do, add an extension different from dat (e.g. bin?).
About unformatted, binary
unformatted : When used with access="sequential" (default) , 4 byte header and footer each time data is written out with a write statement. It is somewhat similar to the formatted format, where a newline code is inserted every time a write statement is written. It is useless.
By the way, if you want to avoid line feed codes in the formatted format
write(10,*,advance='no')binary : It is easy to use because neither header nor footer are included in the write statement. However, it is not implemented except for Intel compilers.
If you must use unformatted except for Intel compiler, use access='direct', recl=bytes or access='stream'.In the former case, the number of bytes specified by recl is used as one group
write(10,rec=5)will write from the top of the file to a location (the number specified by recl)×5 bytes. If there is no group concept,
write(10,pos=5)using stream will write to the 5th byte from the top of the file. Incidentally, in the case of formatted, the number of character is used instead of bytes (including line breaks).
Since calculators can only handle finite-digit numbers, there are cases where digits are dropped between addition and subtraction operations.For example, (a+b)+c and a+(b+c) are not necessarily equal. It is when a-b is close to a and b and the a-b operation is performed that the digit drop becomes serious. Two examples of problems are shown below.
The first is the varianceDo not use strides in numerical calculations or reading/writing files. For example, when using a library of Fourier transforms to perform multiple Fourier transforms
program main use mkl_dfti implicit none type(dfti_descriptor),pointer :: plandft integer,parameter :: n=1024 integer :: statdft complex(8) :: zf(n*5) statdft=dfticreatedescriptor(plandft,dfti_double,dfti_complex,1,n) statdft=dftisetvalue(plandft,dfti_number_of_transforms,5) statdft=dftisetvalue(plandft,dfti_input_distance,1) statdft=dftisetvalue(plandft,dfti_output_distance,1) statdft=dftisetvalue(plandft,dfti_input_strides,(/0,5/)) statdft=dftisetvalue(plandft,dfti_output_strides,(/0,5/)) statdft=dfticommitdescriptor(plandft) statdft=dfticomputeforward(plandft,zf) stop end program mainrather than using a stride like
program main use mkl_dfti implicit none type(dfti_descriptor),pointer :: plandft integer,parameter :: n=1024 integer :: statdft complex(8) :: zf(n*5) statdft=dfticreatedescriptor(plandft,dfti_double,dfti_complex,1,n) statdft=dftisetvalue(plandft,dfti_number_of_transforms,5) statdft=dftisetvalue(plandft,dfti_input_distance,n) statdft=dftisetvalue(plandft,dfti_output_distance,n) statdft=dfticommitdescriptor(plandft) statdft=dfticomputeforward(plandft,zf) stop end program mainto read the data as a single chunk of data for Fourier transform (although stride is necessary if you want to Fourier transform only the 2nd or 3rd dimension). When reading data in AVS/Express, etc., do not use strides, but read the data as a single chunk.
OpenPBS automatically allocates all the numbers specified in ppn to OpenMP parallel
export OMP_NUM_THREADS=$PBS_NUM_PPN
Environment variables specified in the script can be specified with the -v option. For example, in a script
#PBS -o out-$test1-$test2.log ./$out.exe $test1 $test2and so on,
qsub ./hoge.sh -v out=test,test1=1,test2=2makes
./test.exe 1 2will be executed and a log file named out-1-2.log will be created.
Parallelization of do statements
Completely independent do statements can be easily parallelizedreal(8) :: a(1:100),b(1:100),c(1:100) omit !$OMP PARALLEL DO do i=1,100 a(i)=b(i)*c(i) enddo !$OMP END PARALLEL DOThe do statement between !$OMP PARALLEL DO and !$OMP END PARALLEL DO is parallelized. For example, if four cores are paralleized, 1<i<25 is the first core, 26<i<50 is the second, 51<i<75 is the third, 76<i<100 is the fourth, and finally in !$OMP END PARALLEL DO, the data of each ais merged in DO.
real(8) :: a(1:100,1:100),b(1:100,1:100),c(1:100,1:100) omit !$OMP PARALLEL DO do j=1,100 do i=1,100 a(i,j)=b(i,j)*c(i,j) enddo enddo !$OMP END PARALLEL DOIf the do statements are nested, parallel processing is performed on the outer do statement (j in this case).
real(8) :: a(1:100),b(1:99) omit !$OMP PARALLEL DO do i=1,99 a(i+1)=a(i)+b(i) enddo !$OMP END PARALLEL DOthe result a(i) calculated with the previous i is used to calculate a(i+1), but a(26) in a 4-core parallel system, for example, uses the result of a(25), and a(26) is calculated by the second core, a(25) by the first core. The core is calculating a(26), so a(26) does not give the correct result (if a(26) is not correct, everything after that is incorrect).
real(8) :: s,a(1:100) omit s=0.d0 !$OMP PARALLEL DO do i=1,100 s=s+a(i) enddo !$OMP END PARALLEL DOeach core calculates the sum of a(i) in different regions of i independently (memory regions for s are provided by each core independently and cannot be shared), and at the end, different s of different cores cannot be integrated and strange results are obtained. In this case, the following array formula can be used. In this case, use the example array expression below, or use ATOMIC to
real(8) :: s,a(1:100) omit a=0.d0 !$OMP PARALLEL DO do i=1,100 !$OMP ATOMIC s=s+a(i) enddo !$OMP END PARALLEL DOIf you want to do this (with !$OMP ATOMIC is used for the next line, with a common memory modification. The use of array expressions is recommended for speed and stability. ATOMIC should be used only when array expressions cannot be used).
Parallelization of array expressions
Array expressions can also be easily parallelized. For examplereal(8) :: a(1:100),b(1:100),c(1:100) omit !$OMP WORKSHARE a(1:100)=b(1:100)+c(1:100) !$OMP END WORKSHAREArray expressions between !$OMP WORKSHARE and !$OMP END WORKSHARE are parallelized. Array function an also be parallelized, e.g.
real(8) :: s,a(1:100) omit !$OMP WORKSHARE s=sum(a(1:100)) !$OMP END WORKSHAREthe sum of the different ranges of a is computed in each core, and the sum of a is obtained by summing the computed sums in !$OMP END WORKSHARE.
To compile a program using OpenMP, use the compile option -openmp.
ifort -openmp hoge.f90Also, the number of cores must be given to the environment variable OMP_NUM_THREADS to actually make parallel computing. An example of a job script is
#!/bin/bash #PBS -q batch #PBS -e error.log #PBS -o out.log #PBS -l nodes=1:ppn=4 #PBS -l walltime=00:10:00:00 cd $PBS_O_WORKDIR export OMP_NUM_THREADS=4 ./out.exeand specify the number of parallel cores in the ppn in line 4 and the environment variable in line 7. (Caution!) When used with ulimit, the environment variable must be specified before ulimit to work properly. In other words,
#!/bin/bash #PBS -q batch #PBS -e error.log #PBS -o out.log #PBS -l nodes=1:ppn=4 #PBS -l walltime=00:10:00:00 cd $PBS_O_WORKDIR ulimit -s unlimited export OMP_NUM_THREADS=4 ./out.exeis not good enough
#!/bin/bash #PBS -q batch #PBS -e error.log #PBS -o out.log #PBS -l nodes=1:ppn=4 #PBS -l walltime=00:10:00:00 cd $PBS_O_WORKDIR export OMP_NUM_THREADS=4 ulimit -s unlimited ./out.exeand must be
#!/bin/bash #PBS -q batch #PBS -e error.log #PBS -o out.log #PBS -l nodes=1:ppn=4 #PBS -l walltime=00:10:00:00 cd $PBS_O_WORKDIR export OMP_NUM_THREADS=$PBS_NUM_PPN ./out.exeIf the program terminates with an error due to a segmentation violation only when OpenMP parallel is executed, there may be insufficient stack space.Use the environment variable $OMP_STACKSIZE
export OMP_STACKSIZE=(Stack area : KB unit)to increase the stack area.Try starting with about 512MB (512000), depending on the program size.
Parallelization of the random number generation library mkl_vsl using OpenMP
Parallelization with the do statement is very slow because it calls random numbers one by one. Here we call multiple random number seeds and parallelize them manually.program main use mkl_vsl_type use mkl_vsl use omp_lib ! implicit none integer,parameter :: n=1000 integer :: iprocs,nprocs,nprocsm integer,allocatable :: nru(:),nrd(:),nrp(:) integer :: statvsl,irand,cr,cm type(vsl_stream_state),allocatable :: planvsl(:) real(8) :: fake(1),frand(0:n-1) ! !$OMP PARALLEL nprocs=omp_get_max_threads() !$OMP END PARALLEL ! if(mod(n,nprocs)==0) then nprocsm=nprocs-1 allocate(nru(0:nprocsm),nrd(0:nprocsm),nrp(0:nprocsm)) do iprocs=0,nprocsm nrp(iprocs)=n/nprocs nrd(iprocs)=nrp(iprocs)*iprocs nru(iprocs)=nrp(iprocs)*(iprocs+1)-1 enddo else nprocsm=n/(n/nprocs+1) allocate(nru(0:nprocsm),nrd(0:nprocsm),nrp(0:nprocsm)) do iprocs=0,nprocsm nrp(iprocs)=n/nprocs+1 nrd(iprocs)=nrp(iprocs)*iprocs nru(iprocs)=nrp(iprocs)*(iprocs+1)-1 enddo nrp(nprocsm)=n-nrd(nprocsm) nru(nprocsm)=n-1 endif ! allocate(planvsl(0:nprocsm)) do iprocs=0,nprocsm call system_clock(irand,cr,cm) statvsl=vslnewstream(planvsl(iprocs),vsl_brng_mt19937,irand+iprocs) statvsl=vdrnggaussian(vsl_method_dgaussian_icdf,planvsl(iprocs),1,fake,0.d0,1.d0) enddo ! !$OMP PARALLEL iprocs=omp_get_thread_num() if(iprocs<=nprocsm) then statvsl=vdrnguniform(vsl_method_duniform_std,planvsl(iprocs),nrp(iprocs),frand(nrd(iprocs)),0.d0,1.d0) !$OMP END PARALLEL stop end program mainThe iprocs-th processor generates random numbers in the nrp(iprocs) area from frand(nrd(iprocs)) to frand(nru(iprocs)). The disadvantage of this program is that OMP_NUM_THREADS=1 must be properly set even if parallelization is not used.
In OpenMPI parallel, all cores will execute the same program. An example of the simplest OpenMPI program is shown below.
program main implicit none include 'mpif.h' integer :: ierr,nprocs,myrank ! call mpi_init(ierr) call mpi_comm_size(mpi_comm_world,nprocs,ierr) call mpi_comm_rank(mpi_comm_world,myrank,ierr) ! write(*,*) myrank,nprocs ! call mpi_finalize(ierr) stop end program mainIn the third line, a file defining a group of functions necessary for MPI parallel is read. Note that MPI parallel cannot be executed without reading this file. If you want to run MPI parallel on another system, you need to rewrite mpif.h file in the correct location because the location of mpif.h file differs depending on the system.
ierr : integer to store error messages.
nprocs : integer to store the number of all MPI parallel.
myrank : integer to store how many of all MPI paralleld numbers you are.
Declare that the MPI parallel starts at line 6. From here until mpi_finalize on line 12 will be parallelized.
The total number of parallels is stored in nprocs by the MPI built-in function mpi_comm_size on line 7. mpi_comm_world is a variable called MPI communicator, which can be thought of as a kind of spell.
The MPI built-in function mpi_comm_rank in line 8 stores the number of itself out of the total parallel count in myrank. In other words, myrank is assigned a different value for each computation core and is the most important value for parallel computation. 0≤myrank≤nprocs-1.
The end of MPI parallelism is declared in mpi_finalize on line 12. When all cores have finished their calculations, the next line is to be followed. Without this one line, the first core to finish its calculation would execute the 13th stop line, and the program would terminate even though the other cores are still in the middle of their calculations.Therefore, this line is very important.
0 4 1 4 2 4 3 4(the order of the displayed lines may vary). To compile a program written for MPI parallel computation, use mpif90 instead of ifort (or mpif77 if you are writing in FORTRAN77). An example of a job script for MPI parallel computation is
#!/bin/bash #PBS -q batch #PBS -e error.log #PBS -o out.log #PBS -l nodes=4:ppn=1 #PBS -l walltime=00:10:00:00 cd $PBS_O_WORKDIR export OMP_NUM_THREADS=1 mpirun ./out.exeUse mpirun for execution. Note the difference between nodes and ppn. ppn specifies how much to parallelize within a node, while nodes specifies how much to parallelize between nodes. For example, if nodes=1:ppn=4, maui will find the fastest node at the moment and run intra-node parallel computation with a parallel number of 4 on that node.Conversely, if nodes=4:ppn=1, maui assigns jobs to nodes one by one in parallel and measures the speed of each node as needed. When the allocation of cores is determined, if all cores are allocated to the same node, it will be intra-node parallel (i.e., almost the same as when nodes=1:ppn=4), and if they are allocated to multiple nodes, it will be inter-node parallel.The total number of parallels = nodes x ppn. For example, if nodes=2:ppn=4, the calculation will be 8-parallel, as if 4 intra-node parallels were performed by 2 nodes each (for nodes=2, if maui assigns calculation to the same node, it will be 8 intra-node parallels).Since inter-node parallelism involves data communication via LAN cables, the more data communication there is, the worse the parallel efficiency becomes.
Parallelization of completely independent do statements
This is what is called "idiot parallelism," i.e., parallelization that can be substituted by submitting multiple jobs individually without parallelization. For example, if you want to perform the exact same calculation with different parametersprogram main implicit none include 'mpif.h' integer :: ierr,nprocs,myrank ! integer :: it real(8) :: t define other variables call mpi_init(ierr) call mpi_comm_size(mpi_comm_world,nprocs,ierr) call mpi_comm_rank(mpi_comm_world,myrank,ierr) ! do it=myrank,100,nprocs t=dble(it)/100.d0 calculation enddo ! call mpi_finalize(ierr) stop end program mainThis is the simplest parallelization of the way the parameters are moved in parallel when the calculation is performed by moving t in increments of 0.01 for parameters 0≤t≤1. The point to note in this parallel computation is that the 101 t calculations are equally allocated to each core, so that the fastest core will finish first and wai forever, while the slowest core will compute forever, espacially if inter-node parallelism is used. To avoid this, the following improvements can be considered.
program main implicit none include 'mpif.h' integer :: ierr,nprocs,myrank character(len=20) :: crank ! integer :: it real(8) :: t defined other variables call mpi_init(ierr) call mpi_comm_size(mpi_comm_world,nprocs,ierr) call mpi_comm_rank(mpi_comm_world,myrank,ierr) ! it=myrank if(myrank==0) then open(10,file='next',form='binary') write(10) nprocs close(10) endif write(crank,*) myrank call system('sleep '//crank) ! do t=dble(it)/100.d0 calculation open(10,file='next',form='binary',status='old') read(10) it close(10) if(it>100) exit open(10,file='next',form='binary') write(10) it+1 close(10) enddo ! call mpi_finalize(ierr) stop end program mainThis is a program to calculate the next t in turn, starting with the core that has completed the calculation of the do statement in each t. The core that comes to the end of the do statement reads "next" to know the value of the next "it" and overwrites "next" with "it+1". In order to prevent multiple cores from concentrating on writing to and reading from next, the calculation is stopped for (myrank) seconds just before the do statement, and the timing of the calculation is changed slightly for each core. The timing of computation is slightly staggered for each core.
When one wants to earn an ensemble in a Monte Carlo calculation, etc., a parallel calculation can be considered, in which the calculations for each ensemble are parallelized. In this case, it is necessary to add up the values calculated by each core at the end in order to calculate the average value, etc. If we rewrite the previous example one step further
program main implicit none include 'mpif.h' integer :: ierr,nprocs,myrank character(len=20) :: crank ! integer :: it real(8) :: t,s,st define other variables ! call mpi_init(ierr) call mpi_comm_size(mpi_comm_world,nprocs,ierr) call mpi_comm_rank(mpi_comm_world,myrank,ierr) ! it=myrank if(myrank==0) then open(10,file='next.bin',form='binary') write(10) nprocs close(10) endif write(crank,*) myrank call system('sleep '//crank) ! s=0.d0 do calculation s=s+st ! open(10,file='next.bin',form='binary',status='old') read(10) it close(10) if(it>100) exit open(10,file='next.bin',form='binary') write(10) it+1 close(10) enddo mpi_allreduce(mpi_in_place,s,1,mpi_double_precision,mpi_sum,mpi_comm_world,ierr) ! if(myrank==0) then open(11,file='average.dat',form='formatted') write(11,'(e26.16e3)') s/101.d0 close(11) endif ! call mpi_finalize(ierr) stop end program mainThis makes 101 ensembles compute in parallel and adds up the value st per ensemble on each core (=s). Finally, all the s's for each core need to be added together, and this is where data communication is performed. The command for this is mpi_allreduce, which is used in the form
mpi_allreduce(variable to communicate,variable to store the result,number of data to communicate,data type,operation to communicate,mpi communicator,error message)If you want to make the variable you want to communicate with and the variable that stores the result the same, set mpi_in_place for the variable you want to communicate with. In this way, the sum of s of each node is calculated and stored in s of each node. The data to be communicated may be an array. In this case, specify the first address of the array to be communicated and the number of data to be communicated. Data types include mpi_integer, mpi_double_precision, and mpi_double_complex. In the example above, the data of each core is added together during data communication, in which case it becomes mpi_sum. In addition, various other operations can be specified, such as multiplying by the product, as in mpi_product. If an array is specified, the sum and product operations are performed on each element of the array. See the manual for details. Basically, when you specify a command for communication, it is executed when all cores come to the part of the command (i.e., the faster cores are waiting there), but if you want to intentionally synchronize cores (keep them in step), you can use
mpi_barrier(mpi_comm_world,ierr)
program main implicit none include 'mpif.h' integer :: ierr,nprocs,myrank integer :: istatus(mpi_status_size) integer :: kup,kdown ! integer,parameter :: nx=100,ny=100 integer,parameter :: nxm=nx-1,nym=ny-1 integer :: ix,iy integer :: nyi,nyf,nyr real(8),allocatable :: f(:,:),fd(:,:) ! call mpi_init(ierr) call mpi_comm_size(mpi_comm_world,nprocs,ierr) call mpi_comm_rank(mpi_comm_world,myrank,ierr) ! kup=myrank+1 kdown=myrank-1 if(myrank==0) kdown=mpi_proc_null if(myrank==nprocs-1) kup=mpi_proc_null ! if(mod(ny,nprocs)/=0) stop nyr=ny/nprocs nyi=myrank*nyr nyf=(myrank+1)*nyr-1 ! allocate(f(-1:nx,nyi-1:nyf+1),fd(0:nxm,nyi:nyf)) ! do iy=nyi,nyf do ix=0,nxm f(ix,iy)=assign value to f enddo enddo ! do iy=nyi,nyf f(-1,iy)=0.d0 f(nx,iy)=0.d0 enddo if(myrank==0) then do ix=0,nxm f(ix,-1)=0.d0 enddo else if(myrank==nprocs-1) then do ix=0,nxm f(ix,ny)=0.d0 enddo endif ! call mpi_sendrecv(f(0,nyf),nx,mpi_double_precision,kup,1,f(0,nyi-1),nx,mpi_double,precision,kdown,1,mpi_comm_world,istatus,ierr) call mpi_sendrecv(f(0,nyi),nx,mpi_double_precision,kdown,1,f(0,nyf+1),nx,mpi_double,precision,kup,1,mpi_comm_world,istatus,ierr) ! do iy=nyi,nyf do ix=0,nxm fd(ix,iy)=f(ix+1,iy)+f(ix-1,iy)+f(ix,iy+1)+f(ix,iy-1)-4.d0*f(ix,iy) enddo enddo ! call mpi_finalize(ierr) stop end program main
program main implicit none include 'mpif.h' integer :: ierr,ireq,nprocs,nprocsm,myrank integer :: istatus(mpi_status_size) ! integer,parameter :: nx=512 integer,parameter :: nxm=nx-1 integer :: ix,iy,iz,iranky,irankz integer :: nzi,nzf,nzr integer,allocatable :: nzio(:),nzfo(:) complex(8),allocatable :: zf(:,:,:),zft(:,:,:),zf1(:) ! call mpi_init(ierr) call mpi_comm_size(mpi_comm_world,nprocs,ierr) call mpi_comm_rank(mpi_comm_world,myrank,ierr) nprocsm=nprocs-1 ! if(mod(nx,nprocs)/=0) stop nzr=nx/nprocs nzi=myrank*nzr nzf=(myrank+1)*nzr-1 ! allocate(nzio(0:nprocsm),nzfo(0:nprocsm)) do irankz=0,nprocsm nzio(irankz)=irankz*nzr nzfo(irankz)=(irankz+1)*nzr-1 enddo ! allocate(zf(0:nxm,0:nxm,nzi:nzf),zft(0:nxm,nzi:nzf,0:nxm),zf1(0:nxm)) ! do iz=nzi,nzf do iy=0,nxm do ix=0,nxm zf(ix,iy,iz)=assign value to zf enddo enddo enddo ! do iz=nzi,nzf call forwardfft2d(zf(0,0,iz)) !2D Fourier transform for single node enddo ! do irankz=0,nprocsm do iranky=0,nprocsm if(irankz==myrank) then if(iranky==myrank) then do iz=nzi,nzf do iy=nzi,nzf do ix=0,nxm zft(ix,iy,iz)=zf(ix,iy,iz) enddo enddo enddo else do iz=nzi,nzf do iy=nzio(iranky),nzfo(iranky) call mpi_isend(zf(0,iy,iz),nx,mpi_double_complex,iranky,1,mpi_comm_world,ireq,ierr) call mpi_wait(ireq,istatus,ierr) enddo enddo endif else if(iranky==myrank) then do iz=nzio(irankz),nzfo(irankz) do iy=nzi,nzf call mpi_irecv(zft(0,iy,iz),nx,mpi_double_complex,iranky,1,mpi_comm_world,ireq,ierr) call mpi_wait(ireq,istatus,ierr) enddo enddo endif enddo enddo ! do iy=nzi,nzf do ix=0,nxm zf1(iz)=zft(ix,iy,iz) enddo call forwardfft1d(zf1) !1D Fourier transform for single node do ix=0,nxm zft(ix,iy,iz)=zf1(iz) enddo enddo ! do irankz=0,nprocsm do iranky=0,nprocsm if(irankz==myrank) then if(iranky==myrank) then do iz=nzi,nzf do iy=nzi,nzf do ix=0,nxm zf(ix,iy,iz)=zft(ix,iy,iz) enddo enddo enddo else do iz=nzi,nzf do iy=nzio(iranky),nzfo(iranky) call mpi_irecv(zf(0,iy,iz),nx,mpi_double_complex,iranky,1,mpi_comm_world,ireq,ierr) call mpi_wait(ireq,istatus,ierr) enddo enddo endif else if(iranky==myrank) then do iz=nzio(irankz),nzfo(irankz) do iy=nzi,nzf call mpi_isend(zft(0,iy,iz),nx,mpi_double_complex,iranky,1,mpi_comm_world,ireq,ierr) call mpi_wait(ireq,istatus,ierr) enddo enddo endif enddo enddo ! call mpi_finalize(ierr) stop end program main
There are cases where the results of parallel computation with MPI and without MPI are slightly different, even though the computation is exactly the same (in the case of chaos simulations, the results are completely different after a long time of evolution). One of the biggest reasons for this difference is the inter-node operations using mpi_reduce (because the order of the operations changes). For example, when the sum of double-precision real numbers using mpi_sum is compared with the sum of double-precision real numbers in the node at +, a difference is produced. If you want to avoid this, you can use mpi_sendrecv or mpi_gather to transfer data and then perform the necessary operations in the node.Neither of them is correct, though.
Input and output of Fourier transform
statdft=dftisetvalue(plandft,dfti_backward_scale,1.d0/dble(N))The scale factor can be included as a parameter.
Here is an example of a module for a standard 2D FFT.
FFT using Intel Math Kernel Library
module intel_fft use mkl_dfti ! implicit none integer,parameter :: nx=512,ny=512 integer,parameter :: nxy=nx*ny type(dfti_descriptor),pointer,save,private :: plandft integer,save,private :: statdft ! ! contains ! ! subroutine intel_fft_ini statdft=dfticreatedescriptor(plandft,dfti_double,dfti_complex,2,(/nx,ny/)) statdft=dftisetvalue(plandft,dfti_backward_scale,1.d0/dble(nxy)) statdft=dfticommitdescriptor(plandft) return end subroutine intel_fft_ini ! ! subroutine forwardfft(zfunc) complex(8),intent(inout) :: zfunc(nxy) statdft=dfticomputeforward(plandft,zfunc) end subroutine forwardfft ! ! subroutine backwardfft(zfunc) complex(8),intent(inout) :: zfunc(nxy) statdft=dfticomputebackward(plandft,zfunc) end subroutine backwardfft end module intel_fftRun intel_fft_ini once at the beginning of the programonce at the beginning of the program. Unless you have a special reason (use real sine/cosine conversion, use quadruple precision), you should have no problem choosing this one, and OpenMP parallelization will be done automatically. To compile, create a directory (mod in this case) where you want to store the module and type
ifort /opt/intel/compilers_and_libraries/linux/mkl/include/mkl_dfti.f90 -c (additional options) -module mod ifort hoge.f90 -c -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64/ -mkl -lpthread (additional options) -module mod ifort hoge.o -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64/ -mkl -lpthread (additional options) -module mod -o hoge.exe
Notes on handling Intel MKL Real DFT CCE format
When performing a real high-dimensional Fourier transform using Intel's MKL, there are four formats for storing Fourier components: CCE, CCS, Pack, and Perm. Pack an Perm are the same array for real and Fourier components (You have to refer to the manual for the storage method). Personally, I prefer Perm. CCS is out of the question. However, CCS, Pack, and Perm only support up to 2 dimensionas and are not usable in 3 dimensions; CCE stores the Fourier components in the same place as the complex Fourier transform and its array size is \((n_x/2+1) \times n_y\) for \( n_x \times n_y\) real input. Thus we have an extra array. Specifically, for a 2D FFT, the Fourier components are \( B(i,j) \) whenimplicit none type(dfti_descriptor),pointer,save,private :: plandftf,plandftb integer,save,private :: statdftf,statdftb ! statdftf=dfticreatedescriptor(plandftf,dfti_double,dfti_real,3,(/nx,ny,nz/)) statdftf=dftisetvalue(plandftf,dfti_conjugate_even_storage,dfti_complex_complex) statdftf=dftisetvalue(plandftf,dfti_placement,dfti_not_inplace) statdftf=dftisetvalue(plandftf,dfti_input_strides,(/0,1,nx,nx*ny/)) statdftf=dftisetvalue(plandftf,dfti_output_strides,(/0,1,nx/2+1,(nx/2+1)*ny/)) statdftf=dfticommitdescriptor(plandftf) ! statdftb=dfticreatedescriptor(plandftb,dfti_double,dfti_real,3,(/nx,ny,nz/)) statdftb=dftisetvalue(plandftb,dfti_conjugate_even_storage,dfti_complex_complex) statdftb=dftisetvalue(plandftb,dfti_placement,dfti_not_inplace) statdftb=dftisetvalue(plandftb,dfti_backward_scale,1.d0/dble(nx*ny*nz)) statdftb=dftisetvalue(plandftb,dfti_input_strides,(/0,1,nx/2+1,(nx/2+1)*ny/)) statdftb=dftisetvalue(plandftb,dfti_output_strides,(/0,1,nx,nx*ny/)) statdftb=dfticommitdescriptor(plandftb)Forward Fourier transform should be plandftf and inverse Fourier transform should be plandftb. Whan I'm trying to say is Put this much in manual. This kind of problem does not occur in the Perm format (or Pack) . By the way if you want to use Perm format
type(dfti_descriptor),pointer,save,private :: plandft integer,save,private :: statdft statdft=dfticreatedescriptor(plandft,dfti_double,dfti_real,2,(/nx,ny/)) statdft=dftisetvalue(plandft,dfti_backward_scale,1.d0/dble(nx*ny)) statdft=dftisetvalue(plandft,dfti_real_storage,dfti_real_complex) statdft=dftisetvalue(plandft,dfti_packed_format,dfti_pack_format) statdft=dfticommitdescriptor(plandft)If you really want to use Perm format for 3D Fourier transform, you can use the one distributed at Oura's site in Kyoto University. (DCT is supported, so you can use it if you want to use DCT. I mean, why doesn't Intel support DCT while charging a ridiculous amount of money for it?)
FFT using FFTW (version 3)
FFTW itself can be installed after unzipping the files (on Ubuntu it can be installed from the Ubuntu Software Center) by moving to the created directory./configure make make installIf you plan to parallelize with OpenMP or OpenMPI, the first line should be
./configure --enable-threads --enable-mpiThe module is somewhat cumbersome, and the array to be Fourier transformed must be specified in advance.
module fftw_fft use,intrinsic :: iso_c_binding ! implicit none include '/usr/local/include/fftw3.f03' integer,parameter :: nx=512,ny=512 integer,parameter :: nxy=nx*ny type(c_ptr),save,private :: plandftf,plandftb real(8),private :: xlxy complex(8),private :: zf(nx,ny) ! ! contains ! ! subroutine fftw_fft_ini plandftf=fftw_plan_dft_2d(nx,ny,zf,zf,fftw_forward,fftw_measure) plandftb=fftw_plan_dft_2d(nx,ny,zf,zf,fftw_backward,fftw_measure) xlxy=dble(nxy) return end subroutine fftw_fft_ini ! ! subroutine forwardfft(zfunc) complex(8),intent(inout) :: zfunc(nx,ny) zf(1:nx,1:ny)=zfunc(1:nx,1:ny) call fftw_execute_dft(plandftf,zf,zf) zfunc(1:nx,1:ny)=zf(1:nx,1:nx) end subroutine forwardfft ! ! subroutine backwardfft(zfunc) complex(8),intent(inout) :: zfunc(nx,ny) zf(1:nx,1:ny)=zfunc(1:nx,1:ny) call fftw_execute_dft(plandftb,zf,zf) zfunc(1:nx,1:ny)=zf(1:nx,1:nx)/xlxy end subroutine backwardfft end module fftw_fftFor parallelization with OpenMP, the number of parallelism must be specified in the program. If the value of the environment variable OMP_NUM_THREADS is used as the parallel number
module fftw_fft use,intrinsic :: iso_c_binding use omp_lib ! implicit none include '/usr/local/include/fftw3.f03' integer,parameter :: nx=512,ny=512 type(c_ptr),save,private :: plandftf,plandftb integer,private :: iret,nprocs real(8),private :: xlxy complex(8),private :: zf(nx,ny) ! ! contains ! ! subroutine fftw_fft_ini !$OMP PARALLEL nprocs=omp_get_max_threads() !$OMP END PARALLEL ! call dfftw_init_threads(iret) call dfftw_plan_with_nthreads(nprocs) ! plandftf=fftw_plan_dft_2d(nx,ny,zf,zf,fftw_forward,fftw_measure) plandftb=fftw_plan_dft_2d(nx,ny,zf,zf,fftw_backward_fftw_measure) xlxy=dble(nxy) return end subroutine fftw_fft_ini ! ! subroutine forwardfft(zfunc) complex(8),intent(inout) :: zfunc(nx,ny) !$OMP WORKSHARE zf(1:nx,1:ny)=zfunc(1:nx,1:ny) !$OMP END WORKSHARE call fftw_execute_dft(plandftf,zf,zf) !$OMP WORKSHARE zfunc(1:nx,1:ny)=zf(1:nx,1:nx) !$OMP END WORKSHARE end subroutine forwardfft ! ! subroutine backwardfft(zfunc) complex(8),intent(inout) :: zfunc(nx,ny) !$OMP WORKSHARE zf(1:nx,1:ny)=zfunc(1:nx,1:ny) !$OMP END WORKSHARE call fftw_execute_dft(plandftb,zf,zf) !$OMP WORKSHARE zfunc(1:nx,1:ny)=zf(1:nx,1:nx)/xlxy !$OMP END WORKSHARE end subroutine backwardfft end module fftw_fftAlso, if you want to parallelize the Fourier transform only in the x direction, you do not need to declare the number of parallelism for fftw.
module fftw_fft use,intrinsic :: iso_c_binding use omp_lib ! implicit none include '/usr/local/include/fftw3.f03' integer,parameter :: nx=512,ny=512 type(c_ptr),save,private :: plandftf,plandftb real(8),private :: xlx complex(8),private :: zf(nx) ! ! contains ! ! subroutine fftw_fft_ini plandftf=fftw_plan_dft_1d(nx,zf,zf,fftw_forward_fftw_measure) plandftb=fftw_plan_dft_1d(nx,zf,zf,fftw_backward,fftw_measure) xlx=dble(nx) return end subroutine fftw_fft_ini ! ! subroutine forwardfft(zfunc) complex(8),intent(inout) :: zfunc(nx,ny) integer :: iy !$OMP PARALLEL private(zf) !$OMP DO do iy=1,ny zf(1:nx)=zfunc(1:nx,iy) call fftw_execute_dft(plandftf,zf,zf) zfunc(1:nx,iy)=zf(1:nx) enddo !$OMP END DO !$OMP END PARALLEL end subroutine forwardfft ! ! subroutine backwardfft(zfunc) complex(8),intent(inout) :: zfunc(nx,ny) integer :: iy !$OMP PARALLEL private(zf) !$OMP DO do iy=1,ny zf(1:nx)=zfunc(1:nx,iy) call fftw_execute_dft(plandftb,zf,zf) zfunc(1:nx,iy)=zf(1:nx)/xlx enddo !$OMP END DO !$OMP END PARALLEL end subroutine backwardfft end fftw_fftCompile without OpenMP (former example)
ifort hoge.f90 -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64/ -L /usr/local/lib -lfftw3 -lm -mkl -lpthread (additional options) -module mod -o hoge.exeCompile with OpenMP (latter example)
ifort hoge.f90 -L /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64/ -L /usr/local/lib -lfftw3_threads -lfftw3 -lm -mkl -lpthread -qopenmp (additional options) -module mod -o hoge.exe
Let's compare the speed of some simple calculations (all were compared with the -O0 option, so the speed difference may be reduced with the -O3 option).
absolute square of complex number (about 10000000000 calculations on single core)
cdabs(f)**2average 27.5 seconds
dreal(dconjg(f)*f)average 9.8 seconds
dreal(f)**2+dimag(f)**2average 10.5 seconds I thought that the third one was less frequent, but surprisingly the second one is faster.However, the difference is small.
argument of complex numbers
datan2(dimag(f),dreal(f))average 37.8 seconds
dimag(cdlog(f))average 1 min 45.7 seconds
imaginary power
dcmplx(dcos(f),dsin(f))average 1 min 57.3 seconds
cdexp(dcmplx(0.d0,f))average 1 min 58.2 seconds
Gaussian random number creation (using Intel MKL 100000 random number generation calculated 100000 times)
vdrnggaussian(vsl_rng_method_gaussian_icdf,stream,100000,f,0.d0,1.d0)average 1 min 3.3 seconds
vdrnggaussian(vsl_rng_method_gaussian_boxmuller,stream,100000,f,0.d0,1.d0)average 2 minutes 44.7 seconds
vdrnggaussian(vsl_rng_method_gaussian_boxmuller2,stream,100000,f,0.d0,1.d0)average 1 min 47.1 seconds