How to run CASToR in MPI mode?

Dear,
I am a newbee of castor.
I successful build castor-3.1.1 with intel-mpi and intel compiler mpiicpc.
Also I run the benchmark with the .sh script with single server OpenMP mode.
My question is: How to run the bencahmark with mpirun command? Is there some one have mpi-based parallel run example or shell sscript can share to me?
Thanks a lot!

Dear Joao,
The benchmark script just calls castor-recon and then compares the reconstructed image with the reference one. So you could just add the mpi call commands (mpirun/mpiexec) before castor-recon in order to test it.

Let me know if this works for you.

Best,
Thibaut

Hi Tmerlin:

Thanks for your nice reply. I tried add mpirun before castor-recon. The command that I used is :

mpirun -np 4 castor-recon -vb 2 -df benchmark_pet_list-mode_tof.cdh -norm benchmark_pet_list-mode_norm.cdh -fout benchmark_pet_list-mode_challenger -oit -1 -it 2:28 -dim 140,70,63 -fov 560.,280.,252. -off 0.,16.,0. -opti MLEM -proj joseph -conv gaussian,4.,4.,3.5::psf -conv gaussian,6.,6.,3.5::post -th 0 -slice-out 3 -flip-out Y

And I encountered an error as shown below:

→ Image dimensions: [140;70;63] voxels of [Abort(201967372) on node 3 (rank 3 in comm 0): Fatal error in PMPI_Recv: Invalid argument, error stack:

PMPI_Recv(173): MPI_Recv(buf=0x1975638, count=1, MPI_LONG, src=0, tag=0, MPI_COMM_WORLD, status=(nil)) failed

PMPI_Recv(84).: Null pointer in parameter status

I am very eager for your help!

Thanks!

从 Windows 版邮件发送

Hi,
Looks like primary an Intel MPI error, so I am really not sure what is the origin of the issue. Do you also get an error when setting -np to 1 just for testing ?

Best,
Thibaut

Hi Tmerlin:

Thinks for your nice reply!

I’ve tried ‘mpirun -np 1 castor-recon’, it can run normally. But when I changed np to another integer (such as 4), the program reported errors.

I look forward to your further help.

Thanks!

从 Windows 版邮件发送

Hi,
The error occurs early in the initialization phases according to your log, and MPI is not used at this point. Does the error message always appear at this stage when using different np values ?
If not done already I would check first if CASToR has been correctly compiled (if you type castor-recon -help-comp, the output should state “Compiled with MPI”).

Best,
Thibaut

Hi Joao,

Curiously I had a simular problem just a few days ago, with the following error:

oImageDimensionsAndQuantification::Initialize() -> Initialize image dimensions, basis functions and quantification
  --> Image dimensions: [100;100;50] voxels of [4.0000001e-01;4.0000001e-01;4.0000001e-01] mm3
  --> FOV size: [4.0000000e+01;4.0000000e+01;2.0000000e+01] mm3
  --> Number of parallel threads: 4
sRandomNumberGenerator::Seed for rank 0 is 2213572483
sRandomNumberGenerator::Seed for rank 1 is 2182813001
sRandomNumberGenerator::Seed for rank 2 is 3037910000
sRandomNumberGenerator::Seed for rank 3 is 4079532605

job aborted:
[ranks] message

[0] terminated

[1] fatal error
Fatal error in MPI_Recv: Invalid argument, error stack:
MPI_Recv(buf=0x00000207FBA98778, count=1, MPI_LONG, src=0, tag=0, MPI_COMM_WORLD, status=0x0000000000000000) failed
Null pointer in parameter status

[2] fatal error
Fatal error in MPI_Recv: Invalid argument, error stack:
MPI_Recv(buf=0x000001AC57BE80B8, count=1, MPI_LONG, src=0, tag=0, MPI_COMM_WORLD, status=0x0000000000000000) failed
Null pointer in parameter status

[3] fatal error
Fatal error in MPI_Recv: Invalid argument, error stack:
MPI_Recv(buf=0x000001C9C7608A98, count=1, MPI_LONG, src=0, tag=0, MPI_COMM_WORLD, status=0x0000000000000000) failed
Null pointer in parameter status

And I tried to debug with cout the castor-recon.cc file, to see where the MPI_Recv error was coming from.
It appeared it was from the initialization of the random number generator, code snipet:

  // ----------------------------------------------------------------------------------------
  // Random Number Generator initialization: (we first require to know the number of threads to use from p_ImageDimensionsAndQuantification)
  // ----------------------------------------------------------------------------------------
  cout << "Test 1: mpi_rank = " << mpi_rank << endl; 
  if (verbose_general>=5) Cout("----- Random number generator initialization ... -----" << endl);
  
  sRandomNumberGenerator* p_RandomNumberGenerator = sRandomNumberGenerator::GetInstance(); 
  p_RandomNumberGenerator->SetVerbose(verbose_general);
  // Use a user-provided seed to initialize the RNG if one has been provided. Use random number otherwise.
  cout << "Test 2: mpi_rank = " << mpi_rank << endl;
  if (random_generator_seed>=0) p_RandomNumberGenerator->Initialize(random_generator_seed, p_ImageDimensionsAndQuantification->GetNbThreadsMax(), nb_extra_random_generators);
  else p_RandomNumberGenerator->Initialize(p_ImageDimensionsAndQuantification->GetNbThreadsMax(), nb_extra_random_generators);
  cout << "Test 3: mpi_rank = " << mpi_rank << endl;

  if (verbose_general >=5) Cout("----- Random number generator initialization OK -----" << endl);

Where it only printed the test 1 and 2, thus the error coming from the Initialize function of p_RandomNumberGenerator (since random_generator_seed = -1).

In here, i found that the error is coming from this part of the code:

// if more than one MPI instance, generate seeds for each instance 
  // and dispatch them, otherwise keep the initial seed
  if (mpi_size_temp>1)
  {
    if (mpi_rank_temp==0)
    {
      Engine mpi_generator(m_seed);
      for (int p=0; p<mpi_size_temp; p++) 
      {
        m_seed = mpi_generator();

         Cout("sRandomNumberGenerator::Seed for rank " << p << " is " << m_seed << endl); 

        if (p==0)
          temp_seed = m_seed;
        else
          MPI_Send(&m_seed, 1, MPI_LONG, p, 0, MPI_COMM_WORLD);
      }
      m_seed = temp_seed;
    }
    else
    {
      MPI_Status *status = NULL;
      MPI_Recv(&m_seed, 1, MPI_LONG, 0, 0, MPI_COMM_WORLD, status);
    }
    // wait for all the processes to have their seeds
    MPI_Barrier(MPI_COMM_WORLD);
  }

in particular, when there is more than 1 process, the mpi_rank_temp is 1,2,3, entering the else:

  MPI_Status *status = NULL;
  MPI_Recv(&m_seed, 1, MPI_LONG, 0, 0, MPI_COMM_WORLD, status);

where it gives the error of the Null pointer in parameter status. I’ve tried to change to:

  MPI_Status status;
  MPI_Recv(&m_seed, 1, MPI_LONG, 0, 0, MPI_COMM_WORLD, &status);

and now it works fine!


Don’t know it your error is in the same place, since is not exctaly the same, but i hope it helps!

Best,
Miguel

1 Like