2011-03-09 6 views
13

en mi publicación anterior Necesitaba distribuir datos de archivos pgm entre 10 computadoras. Con la ayuda de Jonathan Dursi y Shawn Chin, he integrado el código. Puedo compilar mi programa pero tiene un error de segmentación. Corrí pero nada ocurreLas fallas de segmentación ocurren cuando ejecuto un programa paralelo con Open MPI

mpirun -np 10 ./exmpi_2 balloons.pgm output.pgm

El resultado es

[ubuntu:04803] *** Process received signal *** 
[ubuntu:04803] Signal: Segmentation fault (11) 
[ubuntu:04803] Signal code: Address not mapped (1) 
[ubuntu:04803] Failing at address: 0x7548d0c 
[ubuntu:04803] [ 0] [0x86b410] 
[ubuntu:04803] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x186b00] 
[ubuntu:04803] [ 2] ./exmpi_2(main+0x78e) [0x80492c2] 
[ubuntu:04803] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x141bd6] 
[ubuntu:04803] [ 4] ./exmpi_2() [0x8048aa1] 
[ubuntu:04803] *** End of error message *** 
-------------------------------------------------------------------------- 
mpirun noticed that process rank 1 with PID 4803 on node ubuntu exited on signal 11 (Segmentation fault). 
-------------------------------------------------------------------------- 

entonces intento correr con valgrind para depurar el programa y se genera el output.pgm

valgrind mpirun -np 10 ./exmpi_2 balloons.pgm output.pgm

El resultado es

==4632== Memcheck, a memory error detector 
==4632== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al. 
==4632== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info 
==4632== Command: mpirun -np 10 ./exmpi_2 2.pgm 10.pgm 
==4632== 
==4632== Syscall param sched_setaffinity(mask) points to unaddressable byte(s) 
==4632== at 0x4215D37: syscall (syscall.S:31) 
==4632== by 0x402B335: opal_paffinity_linux_plpa_api_probe_init (plpa_api_probe.c:56) 
==4632== by 0x402B7CC: opal_paffinity_linux_plpa_init (plpa_runtime.c:37) 
==4632== by 0x402B93C: opal_paffinity_linux_plpa_have_topology_information (plpa_map.c:494) 
==4632== by 0x402B180: linux_module_init (paffinity_linux_module.c:119) 
==4632== by 0x40BE2C3: opal_paffinity_base_select (paffinity_base_select.c:64) 
==4632== by 0x40927AC: opal_init (opal_init.c:295) 
==4632== by 0x4046767: orte_init (orte_init.c:76) 
==4632== by 0x804A82E: orterun (orterun.c:540) 
==4632== by 0x804A3EE: main (main.c:13) 
==4632== Address 0x0 is not stack'd, malloc'd or (recently) free'd 
==4632== 
[ubuntu:04638] *** Process received signal *** 
[ubuntu:04639] *** Process received signal *** 
[ubuntu:04639] Signal: Segmentation fault (11) 
[ubuntu:04639] Signal code: Address not mapped (1) 
[ubuntu:04639] Failing at address: 0x7548d0c 
[ubuntu:04639] [ 0] [0xc50410] 
[ubuntu:04639] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0xde4b00] 
[ubuntu:04639] [ 2] ./exmpi_2(main+0x78e) [0x80492c2] 
[ubuntu:04639] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0xd9fbd6] 
[ubuntu:04639] [ 4] ./exmpi_2() [0x8048aa1] 
[ubuntu:04639] *** End of error message *** 
[ubuntu:04640] *** Process received signal *** 
[ubuntu:04640] Signal: Segmentation fault (11) 
[ubuntu:04640] Signal code: Address not mapped (1) 
[ubuntu:04640] Failing at address: 0x7548d0c 
[ubuntu:04640] [ 0] [0xdad410] 
[ubuntu:04640] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0xe76b00] 
[ubuntu:04640] [ 2] ./exmpi_2(main+0x78e) [0x80492c2] 
[ubuntu:04640] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0xe31bd6] 
[ubuntu:04640] [ 4] ./exmpi_2() [0x8048aa1] 
[ubuntu:04640] *** End of error message *** 
[ubuntu:04641] *** Process received signal *** 
[ubuntu:04641] Signal: Segmentation fault (11) 
[ubuntu:04641] Signal code: Address not mapped (1) 
[ubuntu:04641] Failing at address: 0x7548d0c 
[ubuntu:04641] [ 0] [0xe97410] 
[ubuntu:04641] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x1e8b00] 
[ubuntu:04641] [ 2] ./exmpi_2(main+0x78e) [0x80492c2] 
[ubuntu:04641] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x1a3bd6] 
[ubuntu:04641] [ 4] ./exmpi_2() [0x8048aa1] 
[ubuntu:04641] *** End of error message *** 
[ubuntu:04642] *** Process received signal *** 
[ubuntu:04642] Signal: Segmentation fault (11) 
[ubuntu:04642] Signal code: Address not mapped (1) 
[ubuntu:04642] Failing at address: 0x7548d0c 
[ubuntu:04642] [ 0] [0x92d410] 
[ubuntu:04642] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x216b00] 
[ubuntu:04642] [ 2] ./exmpi_2(main+0x78e) [0x80492c2] 
[ubuntu:04642] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x1d1bd6] 
[ubuntu:04642] [ 4] ./exmpi_2() [0x8048aa1] 
[ubuntu:04642] *** End of error message *** 
[ubuntu:04643] *** Process received signal *** 
[ubuntu:04643] Signal: Segmentation fault (11) 
[ubuntu:04643] Signal code: Address not mapped (1) 
[ubuntu:04643] Failing at address: 0x7548d0c 
[ubuntu:04643] [ 0] [0x8f4410] 
[ubuntu:04643] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x16bb00] 
[ubuntu:04643] [ 2] ./exmpi_2(main+0x78e) [0x80492c2] 
[ubuntu:04643] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x126bd6] 
[ubuntu:04643] [ 4] ./exmpi_2() [0x8048aa1] 
[ubuntu:04643] *** End of error message *** 
[ubuntu:04638] Signal: Segmentation fault (11) 
[ubuntu:04638] Signal code: Address not mapped (1) 
[ubuntu:04638] Failing at address: 0x7548d0c 
[ubuntu:04638] [ 0] [0x4f6410] 
[ubuntu:04638] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x222b00] 
[ubuntu:04638] [ 2] ./exmpi_2(main+0x78e) [0x80492c2] 
[ubuntu:04638] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x1ddbd6] 
[ubuntu:04638] [ 4] ./exmpi_2() [0x8048aa1] 
[ubuntu:04638] *** End of error message *** 
[ubuntu:04644] *** Process received signal *** 
[ubuntu:04644] Signal: Segmentation fault (11) 
[ubuntu:04644] Signal code: Address not mapped (1) 
[ubuntu:04644] Failing at address: 0x7548d0c 
[ubuntu:04644] [ 0] [0x61f410] 
[ubuntu:04644] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x1a3b00] 
[ubuntu:04644] [ 2] ./exmpi_2(main+0x78e) [0x80492c2] 
[ubuntu:04644] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x15ebd6] 
[ubuntu:04644] [ 4] ./exmpi_2() [0x8048aa1] 
[ubuntu:04644] *** End of error message *** 
[ubuntu:04645] *** Process received signal *** 
[ubuntu:04645] Signal: Segmentation fault (11) 
[ubuntu:04645] Signal code: Address not mapped (1) 
[ubuntu:04645] Failing at address: 0x7548d0c 
[ubuntu:04645] [ 0] [0x7a3410] 
[ubuntu:04645] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x1d5b00] 
[ubuntu:04645] [ 2] ./exmpi_2(main+0x78e) [0x80492c2] 
[ubuntu:04645] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x190bd6] 
[ubuntu:04645] [ 4] ./exmpi_2() [0x8048aa1] 
[ubuntu:04645] *** End of error message *** 
[ubuntu:04647] *** Process received signal *** 
[ubuntu:04647] Signal: Segmentation fault (11) 
[ubuntu:04647] Signal code: Address not mapped (1) 
[ubuntu:04647] Failing at address: 0x7548d0c 
[ubuntu:04647] [ 0] [0xf54410] 
[ubuntu:04647] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x2bab00] 
[ubuntu:04647] [ 2] ./exmpi_2(main+0x78e) [0x80492c2] 
[ubuntu:04647] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x275bd6] 
[ubuntu:04647] [ 4] ./exmpi_2() [0x8048aa1] 
[ubuntu:04647] *** End of error message *** 
-------------------------------------------------------------------------- 
mpirun noticed that process rank 2 with PID 4639 on node ubuntu exited on signal 11 (Segmentation fault). 
-------------------------------------------------------------------------- 
6 total processes killed (some possibly by mpirun during cleanup) 
==4632== 
==4632== HEAP SUMMARY: 
==4632==  in use at exit: 158,751 bytes in 1,635 blocks 
==4632== total heap usage: 10,443 allocs, 8,808 frees, 15,854,537 bytes allocated 
==4632== 
==4632== LEAK SUMMARY: 
==4632== definitely lost: 81,655 bytes in 112 blocks 
==4632== indirectly lost: 5,108 bytes in 91 blocks 
==4632==  possibly lost: 1,043 bytes in 17 blocks 
==4632== still reachable: 70,945 bytes in 1,415 blocks 
==4632==   suppressed: 0 bytes in 0 blocks 
==4632== Rerun with --leak-check=full to see details of leaked memory 
==4632== 
==4632== For counts of detected and suppressed errors, rerun with: -v 
==4632== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 96 from 9) 

Podría alguien ayudarme a resolver este problema. Este es mi código fuente

#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 
#include "mpi.h" 
#include <syscall.h> 

#define SIZE_X 640 
#define SIZE_Y 480 




int main(int argc, char **argv) 
{ 
FILE *FR,*FW; 
int ierr; 
int rank, size; 
int ncells; 
int greys[SIZE_X][SIZE_Y]; 
int rows,cols, maxval; 

int mystart, myend, myncells; 
const int IONODE=0; 
int *disps, *counts, *mydata; 
int *data; 
int i,j,temp1; 
char dummy[50]=""; 





ierr = MPI_Init(&argc, &argv); 
if (argc != 3) { 
    fprintf(stderr,"Usage: %s infile outfile\n",argv[0]); 
    fprintf(stderr,"outputs the negative of the input file.\n"); 
    return -1; 
}    

ierr = MPI_Comm_rank(MPI_COMM_WORLD, &rank); 
ierr = MPI_Comm_size(MPI_COMM_WORLD, &size); 
if (ierr) { 
    fprintf(stderr,"Catastrophic MPI problem; exiting\n"); 
    MPI_Abort(MPI_COMM_WORLD,1); 
} 

    if (rank == IONODE) { 
      //if (read_pgm(argv[1], &greys, &rows, &cols, &maxval)) { 
      // fprintf(stderr,"Could not read file; exiting\n"); 
       // MPI_Abort(MPI_COMM_WORLD,2); 

     rows=SIZE_X; 
     cols=SIZE_Y; 
     maxval=255; 
     FR=fopen(argv[1], "r+"); 

     fgets(dummy,50,FR); 
     do{ fgets(dummy,50,FR); } while(dummy[0]=='#'); 
     fgets(dummy,50,FR); 

    for (j = 0; j <cols; j++) 
    { 
     for (i = 0; i <rows; i++) 
     { 
      fscanf(FR,"%d",&temp1); 
     greys[i][j] = temp1; 
     } 
    } 
} 

    ncells = rows*cols; 
    disps = (int *)malloc(size * sizeof(int)); 
    counts= (int *)malloc(size * sizeof(int)); 
    data = &(greys[0][0]); /* we know all the data is contiguous */ 

/* everyone calculate their number of cells */ 
ierr = MPI_Bcast(&ncells, 1, MPI_INT, IONODE, MPI_COMM_WORLD); 
myncells = ncells/size; 
mystart = rank*myncells; 
myend = mystart + myncells - 1; 
if (rank == size-1) myend = ncells-1; 
myncells = (myend-mystart)+1; 
mydata = (int *)malloc(myncells * sizeof(int)); 

/* assemble the list of counts. Might not be equal if don't divide evenly. */ 
ierr = MPI_Gather(&myncells, 1, MPI_INT, counts, 1, MPI_INT, IONODE, MPI_COMM_WORLD); 
if (rank == IONODE) { 
    disps[0] = 0; 
    for (i=1; i<size; i++) { 
     disps[i] = disps[i-1] + counts[i-1]; 
    } 
} 

/* scatter the data */ 
ierr = MPI_Scatterv(data, counts, disps, MPI_INT, mydata, myncells, MPI_INT, IONODE, MPI_COMM_WORLD); 

/* everyone has to know maxval */ 
ierr = MPI_Bcast(&maxval, 1, MPI_INT, IONODE, MPI_COMM_WORLD); 

for (i=0; i<myncells; i++) 
    mydata[i] = maxval-mydata[i]; 

/* Gather the data */ 
ierr = MPI_Gatherv(mydata, myncells, MPI_INT, data, counts, disps, MPI_INT, IONODE, MPI_COMM_WORLD); 

if (rank == IONODE) 
{ 
//  write_pgm(argv[2], greys, rows, cols, maxval); 
    FW=fopen(argv[2], "w"); 
    fprintf(FW,"P2\n%d %d\n255\n",rows,cols);  
    for(j=0;j<cols;j++) 
    for(i=0;i<rows;i++) 
    fprintf(FW,"%d ", greys[i][j]); 
} 

free(mydata); 
if (rank == IONODE) { 
    free(counts); 
    free(disps); 
    //free(&(greys[0][0])); 
    //free(greys); 

} 
fclose(FR); 
fclose(FW); 
MPI_Finalize(); 
return 0; 
} 

Ésta es la imagen de entrada http://orion.math.iastate.edu/burkardt/data/pgm/balloons.pgm

+0

¿Qué línea está dando la segfault? – suszterpatt

Respuesta

16

Felicidades; el código casi funcionó perfectamente, murió en casi las últimas líneas de código.

El problema habría sido un poco más claro con valgrind, pero tiene que ser más complicado ejecutar valgrind con MPI - o cualquier cosa que involucre un lanzador de programa. En lugar de:

valgrind mpirun -np 10 ./exmpi_2 balloons.pgm output.pgm

que hace un valgrind de mpirun, que realmente no se preocupan, que quiere hacer

mpirun -np 10 valgrind ./exmpi_2 balloons.pgm output.pgm

- es decir, que desea iniciar 10 valgrinds, cada uno ejecutando un proceso de valor de exmpi_2. Si lo hace (y que ha compilado con -g), se encuentra hacia el final, valgrind salida como la siguiente:

==6303== Access not within mapped region at address 0x1 
==6303== at 0x387FA60C17: [email protected]@GLIBC_2.2.5 (in /lib64/libc-2.5.so) 
==6303== by 0x401222: main (pgm.c:124) 

.. y eso es todo lo que hay que hacer; tiene todos los procesos haciendo el fclose() s, cuando en primer lugar solo un proceso tiene un identificador para un archivo ed fopen(). Simplemente reemplazando

fclose(FR); 
fclose(FW); 

con

if (rank == IONODE) { 
    fclose(FR); 
    fclose(FW); 
} 

parece funcionar para mí.

+0

Después de hacer la operación de dispersión, los datos están en una matriz 1d .. intento hacer detección de bordes usando el operador laplaciano, pero necesita datos en 2d. Puede ser 2d datos después de la dispersión. Tengo problemas para procesar algunas imágenes con 1d de datos después de la dispersión. – arep

Cuestiones relacionadas