Basilisk CFD
Adaptive Cartesian mesh PDE framework
Loading...
Searching...
No Matches
grid.h
Go to the documentation of this file.
1/** @file grid.h
2 */
3/**
4# Grids on GPUs
5
6The files in this directory implement Cartesian and Multigrid grids on
7[Graphics Processing
8Units](https://en.wikipedia.org/wiki/Graphics_processing_unit)
9(GPUs). The ultimate goal is to allow running any [Basilisk
10solver](/src/README#solvers) on GPUs without any modification to the
11original source code.
12
13To do so the [Basilisk preprocessor](/src/ast/README) automatically
14generates "computation kernels" for each [loop
15iterator](/Basilisk%20C#iterators). These kernels are then dynamically
16compiled (at runtime) by the [OpenGL Shading
17Language](https://en.wikipedia.org/wiki/OpenGL_Shading_Language)
18(GLSL) compiler which is part of the (OpenGL) graphics card driver. If
19compilation is successful, the corresponding loop is performed on the
20GPU, otherwise the CPU is used. If this hybrid GPU/CPU hybrid mode of
21operation is used, synchronisation between the GPU and CPU memory is
22necessary and is done automatically.
23
24OpenGL is an open standard (unlike
25e.g. [CUDA](https://en.wikipedia.org/wiki/CUDA)) and is widely
26supported by graphics cards (with the notable exception of Apple
27graphics cards and some high-end "professional" Nvidia cards).
28
29## Running on GPUs
30
31As described above, from a Basilisk perspective GPUs are just another
32type of grid. Selecting a "GPU grid" can simply be done using either
33
34~~~literatec
35#include "grid/gpu/multigrid.h"
36~~~
37
38in the source code, or using the `-grid` command line option of
39[qcc](/src/qcc.c) like this
40
41~~~bash
42qcc -autolink -Wall -O2 -grid=gpu/multigrid code.c -o code -lm
43~~~
44
45The standard Basilisk [Makefile](/Tutorial#using-makefiles) also
46includes the handy recipe
47
48~~~bash
49make code.gpu.tst
50~~~
51
52which will compile and run `code.c` using the `gpu/multigrid` grid.
53
54Note that for all this to work properly you first need to
55[install](#installation) the Basilisk GPU libraries.
56
57## Installation
58
59Basilisk uses the [GLFW](https://www.glfw.org/) library to configure
60and access the graphics card and [OpenGL](https://www.opengl.org/)
61(version >= 4.3) for the rest. These libraries and the associated
62Basilisk libraries can be easily installed on Debian-like systems
63using
64
65~~~bash
66sudo apt install libglfw3-dev
67cd \f$BASILISK/grid/gpu
68make
69~~~
70
71Note that you will also need the appropriate graphics card drivers
72(often proprietary for Nvidia). Note also that (reasonably high-end)
73laptop computers often have two graphics cards: a low-power, slow one
74and a high-power, fast one. To check which one you are currently using
75you can use something like
76
77~~~bash
78sudo apt install mesa-utils
79glxinfo -B
80~~~
81
82On my Dell XPS laptop I can switch to the (proprietary driver of the)
83fast Nvidia graphics card using
84
85~~~bash
86__NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia glxinfo -B
87~~~
88
89## Tests
90
91There are several [test cases for GPUs](/src/test/README#gpu-tests)
92you can try. For example
93
94~~~bash
95cd \f$BASILISK/test
96CFLAGS=-DPRINTNSHADERS make gpu.gpu.tst
97~~~
98
99If this worked, you can then try a more interesting example
100
101~~~bash
102CFLAGS='-DSHOW' make bump2D-gpu.tst
103__NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia ./bump2D-gpu/bump2D-gpu 10
104~~~
105
106and also
107
108~~~bash
109cd $BASILISK/examples
110CFLAGS='-DSHOW -DBENCHMARK' make turbulence.gpu.tst
111__NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia ./turbulence.gpu/turbulence.gpu 1024
112~~~
113
114## Writing code compatible with GPUs
115
116GPUs are fast compared to CPUs because they use specialised hardware
117which relies on highly-parallel (tens of thousands of execution
118threads) asynchronous accesses to fast video memory channels. This
119imposes strong constraints on programs which can run efficiently on
120these systems, in particular regarding memory allocation and
121accesses. These constraints are reflected in the programming languages
122usable on GPUs, for example the [OpenGL Shading
123Language](https://www.khronos.org/opengl/wiki/Core_Language_(GLSL))
124(GLSL) which underlies the GPU grid in Basilisk.
125
126GLSL is mostly a subset of C99 and the good news is that this subset
127happens to be what is used within most `foreach` loops in Basilisk
128(this is not a coincidence...). Thus, in many cases, simple and
129efficient Basilisk code will also run transparently and efficiently on
130GPUs.
131
132### GPU/CPU hybrid mode
133
134There are obvious cases where foreach loops will not run on GPUs (see
135the [next section](#what-cannot-be-done-on-gpus)). In theses cases,
136Basilisk will automatically switch to running the loop on the CPU and
137will synchronize the CPU and GPU memories. Note that this also means
138that the memory (i.e. scalar fields etc.) for a program is always
139allocated twice: once on the CPU and once on the GPU.
140
141As an example, consider the following simple code
142
143~~~literatec
144int main()
145{
146 init_grid (16);
147 scalar s[];
148 double k = 2.;
149 for (int _i = 0; _i < _N; _i++) /* foreach */
150 s[] = cos(k*x)*sin(k*y);
151 for (int _i = 0; _i < _N; _i++) /* foreach */
152 printf ("%g %g %g\n", x, y, s[]);
153}
154~~~
155
156this can be run on the CPU using e.g.
157
158~~~bash
159make test.tst
160~~~
161
162If we now run on the GPU using
163
164~~~bash
165make test.gpu.tst
166~~~
167
168we get
169
170~~~bash
171test.gpu.c:9: GLSL: error: unknown function 'printf'
172test.gpu.c:8: warning: for (int _i = 0; _i < _N; _i++) /* foreach */ done on CPU (see GLSL errors above)
173~~~
174
177which includes "printf") was run on the CPU. Note that the first
178message is a "GLSL: error" but that the code still ran OK on the
180compilation. That's because foreach loops are compiled dynamically at
181runtime by the graphics card driver.
182
183Since GPUs have a very limited access to the operating system
184(i.e. only through the OpenGL interface) we cannot expect the loop
185including "printf" (or any other output) to run on the GPU. Note also
186that the second loop should be "serial" rather than parallel (see
187[Parallel Programming](/Basilisk C#parallel-programming)). So we need
188to modify the code to
189
190~~~literatec
191...
192 foreach (serial)
193 printf ("%g %g %g\n", x, y, s[]);
194...
195~~~
196
197If we now recompile and run with `make test.gpu.tst`, the GLSL error
198and warnings are gone since we explicitely specified that the second
199loop should run on the CPU (and in serial mode).
200
201Another way to specify that a given loop should run on the CPU (either
202in serial or parallel mode) is to use
203
204~~~literatec
205foreach (cpu)
206 ...
207~~~
208
209Similarly one could use `foreach (gpu)` to force running on the
210GPU, in which case the GLSL warning above would become an error. This
211can be useful when debugging GPU codes and used in combination with
212the `-cpu` [compilation flag](/src/qcc.c) which will force loops to
213run on the CPU by default.
214
215### Variable-size arrays
216
217In C99 variable-size arrays can be defined simply using for example
218
219~~~c
220void func1 (int n, double a[n]) {
221 ...
222}
223...
224{
225 int m = ...;
226 double b[m];
227 func (m, b);
228}
229~~~
230
231Since this relies on dynamic memory allocation on the stack, this is
232not possible in general in GLSL. The only cases where this will work
233is if the size of the array can be computed "statically" i.e. at the
234time the GLSL kernel is compiled. Furthermore, the GLSL compiler is
235strict (or not very clever) and a code looking like
236
237~~~c
238int n = 3;
239double a[n];
240~~~
241
242will fail with an error like
243
244~~~bash
245GLSL: error: array size must be a constant valued expression
246~~~
247
248To fix this one needs to write instead
249
250~~~c
251const int n = 3;
252double a[n];
253~~~
254
255Note that the size of the array must be a constant, but only at the
256time when the GLSL kernel is compiled. This allows using variable-size
257arrays also in GLSL, provided their size is constant within the
258kernel. For example the following code will work fine on the GPU, even
259if `n` changes between calls to `func2()`.
260
261~~~literatec
262void func2 (const int n) {
263 for (int _i = 0; _i < _N; _i++) /* foreach */ {
264 const int size = n + 1;
265 double a[size];
266 ...
267 }
268}
269~~~
270
271Finally, using variable-sized arrays as function parameters, as
272done in `func1()` above, is not allowed in GLSL. To work around this
273strong limitation, the [kernel preprocessor](/src/ast/kernels.c) will
274expand calls to functions using variable-size arrays (using the
275[macro](/src/ast/macro.h) engine). Note that this means that the
276function must respect the constraints applying to macros, in
277particular they [can return only at the end of the
278function](/src/ast/macro.h#complex-return-macros).
279
280### What cannot be done on GPUs
281
282* Inputs/Outputs: The only possible direct output on GPUs is the screen (see
283 [output_ppm on GPUs](output.h)). All other inputs or outputs must go through
284 the CPU.
285* Complex memory allocation and access: There is no notion of "memory
286 stack" on GPUs, all memory requirements are ***static*** and must be
287 defined at compilation time. This means that variable/dynamical
288 arrays, dynamic memory allocation (malloc etc.) and ***pointers***
289 do not exist on GPUs (and in GLSL). Using any of these in a foreach
290 loop will give you a GLSL error as above.
291* Limited support for function pointers: function pointers are
292 fundamentally different from memory pointers. Basilisk includes
293 limited support for function pointers i.e. what is sufficient for
294 *field attributes* implementing custom boundary conditions or
295 coarsening/refinement functions.
296* Using external libraries: GPUs cannot (obviously) use functions defined in
297 external (CPU) libraries.
298
299## Current limitations
300
301Aside from the fundamental constraints above, the current
302implementation also has the following limitations, some of which will
303be lifted in the (not-too-distant) future. In rough order of "lifting
304priority" these are:
305
306* Only 2D Cartesian and Multigrid grids for now: 3D multigrid will
307 follow easily, quadtrees and octrees are more difficult.
308* The maximum size of any scalar field is limited to what can be
309 indexed using a 32-bits unsigned integer i.e. 2^32^ floats or 16 GB.
310* Boundary conditions have only been implemented for 3x3 stencils.
311* At this stage only a few solvers [have been
312 tested](/src/test/README#gpu-tests). Other solvers may or may not
313 work. In particular [surface tension](/src/tension.h) will not work
314 yet because the [estimation of curvature](/src/curvature.h) relies
315 on [code which is not portable to GPUs](/src/parabola.h).
316* Only simple external types (`int`, `bool`, `float`, `double`,
317 `coord` etc.) are currently supported: for example custom `typedefs`
318 are not supported yet for external variables. Loop-local variables
319 and functions can use (locally-defined) custom types.
320* Loop-local [lists](/Basilisk%20C#lists) of scalars have very limited
321 support (but are not used much anyway): external loops support is
322 OK.
323* Double precision (64-bits floats) is supported by Basilisk (use
324 `CFLAGS='-DDOUBLE_PRECISION'`) but depends on the (often limited)
325 support by graphics cards and their drivers. Note also that using
326 single precision can have an important impact on the convergence and
327 accuracy of [multigrid solvers](/src/poisson.h).
328
329## Performance
330
331To convince yourself that GPUs are worth the trouble, see the [GPU
332Benchmarks](Benchmarks.md): speedups of one to two orders of magnitude
333compared to CPUs are achievable.
334
335To maximize performance, here are a few tips and observations:
336
337* Make sure that you are using the correct graphics card and driver
338 (see [glxinfo](#installation) above).
339* GPUs are highly parallel so will only provide speedups for large
340 enough simulations (e.g. larger than 128^2^), increasingly so as
341 resolution is increased.
342* Frequent CPU/GPU memory synchronisation will kill performance. Be
343 careful to control how often you output data for example, much more
344 so than when running on CPUs. An exception is [graphical
345 outputs](output.h) which are much cheaper on GPUs and can be done
346 often with little overhead. Loops done on the CPU within
347 e.g. timestep iterations will generally kill performance.
348* Use [built-in profiling](/src/README.trace) to check where time is
349 spent. Use the `-DTRACE=3` compilation flag to get profiling
350 information at the level of foreach loops.
351
352## Bugs
353
354* Trying to use more than one shader storage buffer (SSBO) with Intel
355 drivers does not seem work. This limits the amount of available
356 video memory to a single SSBO with a maximum size which is usually
357 2GB (or less).
358
359## See also
360
361* [Test cases on GPUs](/src/test/README#gpu-tests)
362* [GPU benchmarks](Benchmarks.md)
363* [Computation kernels](/src/ast/kernels.c)
364
365# Implementation */
366
367bool on_cpu = false;
368
369#include <khash.h>
370
371typedef struct {
372 int location, type, nd, local;
373 void * pointer;
374} MyUniform;
375
376typedef struct {
377 GLuint id, ng[2];
378 MyUniform * uniforms;
379} Shader;
380
381KHASH_MAP_INIT_INT(INT, Shader *)
382
383typedef struct {
384 GRIDPARENT parent;
385 GLuint reduct[2];
386 khash_t(INT) * shaders;
387} GridGPU;
388
389#define gpu_grid ((GridGPU *)grid)
390
391static char * str_append_array (char * dst, const char * list[])
392{
393 int empty = (dst == NULL);
394 int len = empty ? 0 : strlen (dst);
395 for (const char ** s = list; *s != NULL; s++)
396 len += strlen (*s);
397 dst = (char *) sysrealloc (dst, len + 1);
398 if (empty) dst[0] = '\0';
399 for (const char ** s = list; *s != NULL; s++)
400 strcat (dst, *s);
401 return dst;
402}
403
404#define str_append(dst, ...) str_append_array (dst, (const char *[]){__VA_ARGS__, NULL})
405#define xstr(a) str(a)
406#define str(a) #a
407
408static char glsl_preproc[] =
409 "// #line " xstr(LINENO) " " __FILE__ "\n"
410 "#define /* dim: x */\n"
411 "#define fmin(a,b) min(a,b)\n"
412 "#define fmax(a,b) max(a,b)\n"
413#if !SINGLE_PRECISION
414 "#define real double\n"
415 "#define coord dvec3\n"
416#else // !SINGLE_PRECISION
417 "#define real float\n"
418 "#define coord vec3\n"
419#endif // !SINGLE_PRECISION
420 "#define ivec ivec2\n"
421 "struct scalar { int i, index; };\n"
422#if dimension == 2
423#if SINGLE_PRECISION
424 "#define _coord vec2\n"
425#else
426 "#define _coord dvec2\n"
427 "#define cos(x) cos(float(x))\n"
428 "#define sin(x) sin(float(x))\n"
429 "#define exp(x) exp(float(x))\n"
430 "#define pow(x,y) pow(float(x), float(y))\n"
431#endif
432 "struct vector { scalar x, y; };\n"
433 "struct tensor { vector x, y; };\n"
434#endif // dimension == 2
435 "#define GHOSTS " xstr(GHOSTS) "\n"
436 "struct Point { int i, j, level;"
437#if MULTIGRID
438 " ivec2 n;"
439#else
440 " uint n;"
441#endif
442#if LAYERS
443 " int l;"
444#endif
445 "};\n"
446 "#define field_size() _field_size\n"
447 "#define ast_pointer(x) (x)\n"
448 GPU_CODE()
449 "#define _NVARMAX 65536\n"
450 "#define is_constant(v) ((v).i >= _NVARMAX)\n"
451 "#define NULL 0\n"
452 "#define val(s,k,l,m) ((s).i < _NVARMAX ? valt(s,k,l,m)"
453 " : _constant[clamp((s).i -_NVARMAX,0,_nconst-1)])\n"
454 "#define val_out_(s,i,j,k) valt(s,i,j,k)\n"
455 "#define diagonalize(a)\n"
456 "#define val_diagonal(s,i,j,k) real((i) == 0 && (j) == 0 && (k) == 0)\n"
457 "#define _attr(s,member) (_attr[(s).index].member)\n"
458 "#define forin(type,s,list) for (int _i = 0; _i < list.length() - 1; _i++) { type s = list[_i];\n"
459 "#define endforin() }\n"
460#if LAYERS
461 "#define _index(a,m) ((a).i + (point.l + _layer + (m) < _attr(a,block) ? point.l + _layer + (m) : 0))\n"
462#else
463 "#define _index(a,m) ((a).i)\n"
464#endif
465 "#define forin2(a,b,c,d) for (int _i = 0; _i < c.length() - 1; _i++)"
466 " { a = c[_i]; b = d[_i];\n"
467 "#define endforin2() }\n"
468 "#define forin3(a,b,e,c,d,f) for (int _i = 0; _i < c.length() - 1; _i++)"
469 " { a = c[_i]; b = d[_i]; e = f[_i];\n"
470 "#define endforin3() }\n"
471 "#define NOT_UNUSED(x)\n"
472 "#define pi 3.14159265359\n"
473 "#define nodata (1e30)\n"
474 "#define fabs(x) abs(x)\n"
475 "#define neighborp(_i,_j,_k) Point(point.i+_i,point.j+_j,point.level,point.n"
476#if LAYERS
477 ",point.l"
478#endif
479 ")\n"
480 "const real z = 0.;\n"
481 "const int ig = 0, jg = 0;\n"
482 "layout (location = 0) uniform ivec2 csOrigin = ivec2(0,0);\n"
483 "layout (location = 1) uniform vec2 vsOrigin = vec2(0.,0.);\n"
484 "layout (location = 2) uniform vec2 vsScale = vec2(1.,1.);\n"
485 ;
486
487static inline int list_size (const External * i)
488{
489 int size = 0;
490 if (i->type == sym_SCALAR) {
491 size = 1;
492 if (i->nd == 1)
493 for (int _s = 0; _s < 1; _s++) /* scalar in i->pointer */ size++;
494 }
495 else if (i->type == sym_VECTOR) {
496 size = 1;
497 if (i->nd == 1)
498 for (int _v = 0; _v < 1; _v++) /* vector in i->pointer */ size++;
499 }
500 else if (i->type == sym_TENSOR) {
501 size = 1;
502 if (i->nd == 1)
503 for (int _t = 0; _t < 1; _t++) /* tensor in i->pointer */ size++;
504 }
505 else
506 return 0;
507 return size;
508}
509
510static char * write_scalar (char * fs, scalar s)
511{
512 char i[20], index[20];
513 snprintf (i, 19, "%d", s.i);
514 if (s.i < 0 || is_constant(s))
515 strcpy (index, "0");
516 else {
517 if (s.block < 0)
518 s.i += s.block;
519 snprintf (index, 19, "%d", s.gpu.index - 1);
520 }
521 return str_append (fs, "{", i, ",", index, "}");
522}
523
524static char * write_vector (char * fs, vector v)
525{
526 fs = str_append (fs, "{");
527 fs = write_scalar (fs, v.x);
528 fs = str_append (fs, ",");
529 fs = write_scalar (fs, v.y);
530 fs = str_append (fs, "}");
531 return fs;
532}
533
534static char * write_tensor (char * fs, tensor t)
535{
536 fs = str_append (fs, "{");
537 fs = write_vector (fs, t.x);
538 fs = str_append (fs, ",");
539 fs = write_vector (fs, t.y);
540 fs = str_append (fs, "}");
541 return fs;
542}
543
544static scalar * apply_bc_list;
545
546static int bc_period_x = -1, bc_period_y = -1;
547
548static void boundary_top (Point point, int i)
549{
550 bool data = false;
551 for (int _s = 0; _s < 1; _s++) /* scalar in apply_bc_list */
552 if (!s.face || s.i != s.v.y.i) {
553 scalar b = (s.v.x.i < 0 ? s : s.i == s.v.y.i ? s.v.x : s.v.y);
554 /* foreach_blockf */
555 s[i,-bc_period_y] = b.boundary_top (neighborp(i), neighborp(i,-bc_period_y), s, &data);
556 }
557}
558
559static void boundary_bottom (Point point, int i)
560{
561 bool data = false;
562 for (int _s = 0; _s < 1; _s++) /* scalar in apply_bc_list */
563 if (!s.face || s.i != s.v.y.i) {
564 scalar b = (s.v.x.i < 0 ? s : s.i == s.v.y.i ? s.v.x : s.v.y);
565 /* foreach_blockf */
566 s[i,bc_period_y] = b.boundary_bottom (neighborp(i), neighborp(i,bc_period_y), s, &data);
567 }
568}
569
570static
571void apply_bc (Point point)
572{
573 bool data = false;
574 // face BCs
575 if (point.i == GHOSTS)
576 for (int _s = 0; _s < 1; _s++) /* scalar in apply_bc_list */
577 if (s.face && s.i == s.v.x.i && s.boundary_left)
578 /* foreach_blockf */
579 s[] = s.boundary_left (point, neighborp(bc_period_x), s, &data);
580 if (point.i == N*Dimensions.x + GHOSTS)
581 for (int _s = 0; _s < 1; _s++) /* scalar in apply_bc_list */
582 if (s.face && s.i == s.v.x.i && s.boundary_right)
583 /* foreach_blockf */
584 s[] = s.boundary_right (neighborp(bc_period_x), point, s, &data);
585 if (point.j == GHOSTS)
586 for (int _s = 0; _s < 1; _s++) /* scalar in apply_bc_list */
587 if (s.face && s.i == s.v.y.i) {
588 scalar b = s.v.x;
589 if (b.boundary_bottom)
590 /* foreach_blockf */
591 s[] = b.boundary_bottom (point, neighborp(0,bc_period_y), s, &data);
592 }
593 if (point.j == N*Dimensions.y + GHOSTS)
594 for (int _s = 0; _s < 1; _s++) /* scalar in apply_bc_list */
595 if (s.face && s.i == s.v.y.i) {
596 scalar b = s.v.x;
597 if (b.boundary_top)
598 /* foreach_blockf */
599 s[] = b.boundary_top (neighborp(0,bc_period_y), point, s, &data);
600 }
601 // centered BCs
602 if (point.i == GHOSTS) { // left
603 for (int _s = 0; _s < 1; _s++) /* scalar in apply_bc_list */
604 if (!s.face || s.i != s.v.x.i)
605 /* foreach_blockf */
606 s[bc_period_x] = s.boundary_left (point, neighborp(bc_period_x), s, &data);
607 if (point.j == GHOSTS)
608 boundary_bottom (point, bc_period_x); // bottom-left
609 if (point.j == N*Dimensions.y + GHOSTS - 1)
610 boundary_top (point, bc_period_x); // top-left
611 }
612 if (point.i == N*Dimensions.x + GHOSTS - 1) { // right
613 for (int _s = 0; _s < 1; _s++) /* scalar in apply_bc_list */
614 if (!s.face || s.i != s.v.x.i)
615 /* foreach_blockf */
616 s[- bc_period_x] = s.boundary_right (point, neighborp(- bc_period_x), s, &data);
617 if (point.j == GHOSTS)
618 boundary_bottom (point, - bc_period_x); // bottom-right
619 if (point.j == N*Dimensions.y + GHOSTS - 1)
620 boundary_top (point, - bc_period_x); // top-right
621 }
622 if (point.j == GHOSTS)
623 boundary_bottom (point, 0); // bottom
624 if (point.j == N*Dimensions.y + GHOSTS - 1)
625 boundary_top (point, 0); // top
626}
627
628static bool is_boundary_attribute (const External * g)
629{
630 return (g->name[0] == '.' &&
631 (!strcmp (g->name, ".boundary_left") ||
632 !strcmp (g->name, ".boundary_right") ||
633 !strcmp (g->name, ".boundary_bottom") ||
634 !strcmp (g->name, ".boundary_top")));
635}
636
637// fixme: for the moment only 'const int' are considered, this could be generalised
638#define IS_EXTERNAL_CONSTANT(g) ((g)->constant && (g)->type == sym_INT && !(g)->data)
639
640static
641void hash_external (Adler32Hash * hash, const External * g, const ForeachData * loop, int indent)
642{
643 if (g->type == sym_function_declaration || g->type == sym_function_definition) {
644 bool boundary = is_boundary_attribute (g);
645 for (int _s = 0; _s < 1; _s++) /* scalar in baseblock */
646 if (g->name[0] != '.' || (!boundary && s.gpu.index) ||
647 (boundary && (s.stencil.io & s_output))) {
648 void * ptr = g->name[0] != '.' ? g->pointer :
649 *((void **)(((char *) &_attribute[s.i]) + g->nd));
650 if (ptr) {
651 External * p = _get_function ((long) ptr);
652 if (p && !p->used) {
653 p->used = true;
654 a32_hash_add (hash, &ptr, sizeof (void *));
655 for (const External * e = p->externals; e && e->name; e++)
656 hash_external (hash, e, loop, indent + 2);
657 }
658 }
659 if (g->name[0] != '.')
660 break;
661 }
662 }
663 else if (g->name[0] == '.') {
664 int size = 0;
665 switch (g->type) {
666 case sym_INT: size = sizeof (int); break;
667 case sym_BOOL: size = sizeof (bool); break;
668 case sym_FLOAT: size = sizeof (float); break;
669 case sym_DOUBLE: size = sizeof (double); break;
670 case sym_IVEC: size = sizeof (ivec); break;
671 case sym_SCALAR: size = sizeof (scalar); break;
672 case sym_VECTOR: size = sizeof (vector); break;
673 case sym_TENSOR: size = sizeof (tensor); break;
674 default: return;
675 }
676 for (int _s = 0; _s < 1; _s++) /* scalar in baseblock */
677 if (s.gpu.index) {
678 char * data = (char *) &_attribute[s.i];
679 a32_hash_add (hash, data + g->nd, size);
680 }
681 }
682 else if (g->type == sym_SCALAR || g->type == sym_VECTOR || g->type == sym_TENSOR) {
683 assert (g->nd == 0 || g->nd == 1);
684 void * pointer = g->pointer;
685 int size;
686 if (g->nd == 1) {
687 size = 0;
688 for (int _s = 0; _s < 1; _s++) /* scalar in pointer */
689 size += sizeof (scalar);
690 }
691 else if (g->type == sym_SCALAR)
692 size = sizeof (scalar);
693 else if (g->type == sym_VECTOR)
694 size = sizeof (vector);
695 else // sym_TENSOR
696 size = sizeof (tensor);
697 a32_hash_add (hash, pointer, size);
698 }
699 else if (IS_EXTERNAL_CONSTANT(g))
700 a32_hash_add (hash, g->pointer, sizeof(int));
701}
702
703static
704uint32_t hash_shader (const External * externals,
705 const ForeachData * loop,
706 const RegionParameters * region, const char * kernel)
707{
708 Adler32Hash hash;
709 a32_hash_init (&hash);
710 a32_hash_add (&hash, &region->level, sizeof (region->level));
711 a32_hash_add (&hash, &N, sizeof (N));
712#if LAYERS
713 a32_hash_add (&hash, &nl, sizeof (nl));
714#endif
715 a32_hash_add (&hash, &Dimensions, sizeof (Dimensions));
716 a32_hash_add (&hash, &Period, sizeof (Period));
717 a32_hash_add (&hash, kernel, strlen (kernel));
718 a32_hash_add (&hash, &nconst, sizeof (nconst));
719 a32_hash_add (&hash, _constant, sizeof (*_constant)*nconst);
720 static External ext[] = {
721 { .name = ".boundary_left", .type = sym_function_declaration,
722 .nd = attroffset (boundary_left) },
723 { .name = ".boundary_right", .type = sym_function_declaration,
724 .nd = attroffset (boundary_right) },
725 { .name = ".boundary_top", .type = sym_function_declaration,
726 .nd = attroffset (boundary_top) },
727 { .name = ".boundary_bottom", .type = sym_function_declaration,
728 .nd = attroffset (boundary_bottom) },
729#if LAYERS
730 { .name = ".block", .type = sym_INT, .nd = attroffset (block) },
731#endif
732 { .name = NULL }
733 };
734 foreach_function (f, f->used = false);
735 for (const External * g = loop->dirty ? ext : ext + 4; g->name; g++)
736 hash_external (&hash, g, loop, 2);
737 int imax = 1;
738 if (GPUContext.nssbo > 1)
739 for (int _s = 0; _s < 1; _s++) /* scalar in baseblock */
740 if (s.gpu.index && s.i + s.block > imax)
741 imax = s.i + s.block;
742 for (const External * g = externals; g && g->name; g++) {
743 if (g->reduct) {
744 a32_hash_add (&hash, &g->s, sizeof (scalar));
745 if (GPUContext.nssbo > 1) {
746 scalar s = g->s;
747 if (s.i + s.block > imax)
748 imax = s.i + s.block;
749 }
750 }
751 hash_external (&hash, g, loop, 2);
752 }
753 a32_hash_add (&hash, &imax, sizeof (imax));
754 return a32_hash (&hash);
755}
756
757static
758bool is_void_function (char * code)
759{
760 while (*code != '/') code++; code++;
761 while (*code != '/') code++; code++;
762 while (*code != '\n') code++; code++;
763 while (strchr ("\n\r\t ", *code)) code++;
764 return !strncmp (code, "void ", 5);
765}
766
767static char * type_string (const External * g)
768{
769 switch (g->type) {
770 case sym_DOUBLE:
771#if !SINGLE_PRECISION
772 return "double"; break;
773#endif
774 case sym_FLOAT: return "float";
775 case sym_function_declaration: case sym_enumeration_constant:
776 case sym_INT: case sym_LONG: return "int";
777 case sym_BOOL: return "bool";
778 case sym_SCALAR: return "scalar";
779 case sym_VECTOR: return "vector";
780 case sym_TENSOR: return "tensor";
781 case sym_COORD: return "coord";
782 case sym__COORD: return "_coord";
783 case sym_VEC4: return "vec4";
784 case sym_IVEC: return "ivec";
785 }
786 return "unknown_type";
787}
788
789#define EXTERNAL_NAME(g) (g)->global == 2 ? "_loc_" : "", (g)->name, (g)->reduct ? "_in_" : ""
790
791trace
792char * build_shader (External * externals, const ForeachData * loop,
793 const RegionParameters * region, const GLuint nwg[2])
794{
795 int Nl = region->level > 0 ? 1 << (region->level - 1) : N/Dimensions.x;
796 char s[30];
797 snprintf (s, 19, "%d", nconst > 0 ? nconst : 1);
798 char a[20];
799 snprintf (a, 19, "%g", nconst > 0 ? _constant[0] : 0);
800 char * fs = str_append (NULL, "#version 430\n", glsl_preproc,
801 "const int _nconst = ", s, ";\n"
802 "const real _constant[_nconst] = {", a);
803 for (int i = 1; i < nconst; i++) {
804 snprintf (a, 19, "%g", _constant[i]);
805 fs = str_append (fs, ",", a);
806 }
807 fs = str_append (fs, "};\n"
808 "layout(std430, binding = 0)"
809 " restrict buffer _data_layout { real f[]; } _data");
810 if (GPUContext.nssbo > 1) {
811 snprintf (a, 19, "%d", GPUContext.nssbo);
812 snprintf (s, 29, "%ld", GPUContext.max_ssbo_size/sizeof(real));
813 fs = str_append (fs, "[", a, "];\n"
814 "#define _data_val(field,index) _data[_offset[(field)].i+(_offset[(field)].j+(index))/", s,
815 "].f[(_offset[(field)].j+(index))%", s, "]\n");
816 }
817 else
818 fs = str_append (fs, ";\n"
819 "#define _data_val(field,index) _data.f[(field)*field_size() + (index)]\n");
820
821 /**
822 Scalar field attributes */
823
824 char * attributes = NULL;
825 for (const External * g = externals; g; g = g->next)
826 if (g->name[0] == '.') {
827 attributes = str_append (attributes, " ", type_string (g), " ", g->name + 1);
828 for (int * d = g->data; d && *d > 0; d++) {
829 char s[20]; snprintf (s, 19, "%d", *d);
830 attributes = str_append (attributes, "[", s, "]");
831 }
832 attributes = str_append (attributes, ";\n");
833 }
834
835 if (attributes) {
836 fs = str_append (fs, "struct _Attributes {\n", attributes, "};\n");
837 sysfree (attributes);
838 int nindex = 0;
839 for (int _s = 0; _s < 1; _s++) /* scalar in baseblock */
840 if (s.gpu.index)
841 nindex++;
842 assert (nindex > 0);
843 char s[20]; snprintf (s, 19, "%d", nindex);
844 fs = str_append (fs, "const _Attributes _attr[", s, "]={");
845 nindex = 0;
846 for (int _s = 0; _s < 1; _s++) /* scalar in baseblock */
847 if (s.gpu.index) {
848 fs = str_append (fs, nindex ? ",{" : "{"); nindex++;
849 bool first = true;
850 char * data = (char *) &_attribute[s.i];
851 for (const External * g = externals; g; g = g->next)
852 if (g->name[0] == '.') {
853 if (!first) fs = str_append (fs, ",");
854 first = false;
855 if (g->type == sym_INT) {
856 char s[20]; snprintf (s, 19, "%d", *((int *)(data + g->nd)));
857 fs = str_append (fs, s);
858 }
859 else if (g->type == sym_BOOL)
860 fs = str_append (fs, *((bool *)(data + g->nd)) ? "true" : "false");
861 else if (g->type == sym_FLOAT) {
862 char s[20]; snprintf (s, 19, "%g", *((float *)(data + g->nd)));
863 fs = str_append (fs, s);
864 }
865 else if (g->type == sym_DOUBLE) {
866 char s[20]; snprintf (s, 19, "%g", *((double *)(data + g->nd)));
867 fs = str_append (fs, s);
868 }
869 else if (g->type == sym_IVEC) {
870 ivec * v = (ivec *)(data + g->nd);
871 char s[20]; snprintf (s, 19, "{%d,%d}", v->x, v->y);
872 fs = str_append (fs, s);
873 }
874 else if (g->type == sym_function_declaration) {
875 void * func = *((void **)(data + g->nd));
876 if (!func)
877 fs = str_append (fs, "0");
878 else {
879 External * ptr = _get_function ((long) func);
880 char s[20]; snprintf (s, 19, "%d", ptr->nd);
881 fs = str_append (fs, s);
882 }
883 }
884 else if (g->type == sym_SCALAR)
885 fs = write_scalar (fs, *((scalar *)(data + g->nd)));
886 else if (g->type == sym_VECTOR)
887 fs = write_vector (fs, *((vector *)(data + g->nd)));
888 else if (g->type == sym_TENSOR)
889 fs = write_tensor (fs, *((tensor *)(data + g->nd)));
890 else
891 fs = str_append (fs, "not implemented");
892 }
893 fs = str_append (fs, "}");
894 }
895 fs = str_append (fs, "};\n");
896 }
897
898 /**
899 Field offsets when using multiple SSBOs */
900
901 if (GPUContext.nssbo > 1) {
902 int imax = 1;
903 for (int _s = 0; _s < 1; _s++) /* scalar in baseblock */
904 if (s.gpu.index && s.i + s.block > imax)
905 imax = s.i + s.block;
906 for (External * g = externals; g; g = g->next)
907 if (g->reduct) {
908 scalar s = g->s;
909 if (s.i + s.block > imax)
910 imax = s.i + s.block;
911 }
912 char string[100]; snprintf (string, 19, "%d", imax);
913 fs = str_append (fs, "const struct { uint i, j; } _offset[", string, "]={");
914 for (int s = 0; s < imax; s++) {
915 fs = str_append (fs, s ? ",{" : "{");
916 size_t offset = s*field_size(), size = GPUContext.max_ssbo_size/sizeof(real);
917 size_t i = offset/size, j = offset%size;
918 snprintf (string, 99, "%ld,%ld", i, j);
919 fs = str_append (fs, string, "}");
920
921 if (j + field_size() > 1L << 32) {
922 fprintf (stderr, "%s:%d: error: resolution is too high for 32-bits indexing\n",
923 __FILE__, LINENO);
924 exit (1);
925 }
926 }
927 fs = str_append (fs, "};\n");
928 }
929
930 /**
931 Non-local variables */
932
933 for (External * g = externals; g; g = g->next) {
934 if (g->name[0] == '.') {
935 if (g->type == sym_function_declaration) {
936 bool boundary = is_boundary_attribute (g);
937 fs = str_append (fs, "#define _attr_", g->name + 1, "(s,args) (");
938 foreach_function (f, f->used = false);
939 char * expr = NULL;
940 for (int _s = 0; _s < 1; _s++) /* scalar in baseblock */
941 if (s.gpu.index) {
942 char * data = (char *) &_attribute[s.i];
943 void * func = *((void **)(data + g->nd));
944 if (func) {
945 External * f = _get_function ((long) func);
946 if (!f->used && (!boundary || (s.stencil.io & s_output))) {
947 f->used = true;
948 char * args = is_void_function (f->data) ? " args,0" : " args";
949 if (!expr)
950 expr = str_append (NULL, "(", f->name, args, ")");
951 else {
952 char index[20];
953 snprintf (index, 19, "%d", f->nd);
954 char * s = str_append (NULL, "_attr(s,", g->name + 1, ")==", index,
955 "?(", f->name, args, "):", expr);
956 sysfree (expr);
957 expr = s;
958 }
959 }
960 }
961 }
962 if (expr) {
963 fs = str_append (fs, expr, ")\n");
964 sysfree (expr);
965 }
966 else
967 fs = str_append (fs, "0)\n");
968 }
969 }
970 else if (g->type == sym_function_definition) {
971 External * f = _get_function ((long) g->pointer);
972 fs = str_append (fs, "\n", f->data, "\n");
973 char s[20]; snprintf (s, 19, "%d", f->nd);
974 fs = str_append (fs, "const int _p", g->name, " = ", s, ";\n");
975 }
976 else if (g->type == sym_function_declaration) {
977 External * f = _get_function ((long) g->pointer);
978 char s[20]; snprintf (s, 19, "%d", f->nd);
979 fs = str_append (fs, "const int ", g->name, " = ", s, ";\n",
980 "#define _f", g->name, "(args) (", f->name, " args)\n");
981 }
982 else if (g->type != sym_SCALAR &&
983 g->type != sym_VECTOR &&
984 g->type != sym_TENSOR) {
985 if (g->type == sym_INT && !strcmp (g->name, "N")) {
986 int level = region->level > 0 ? region->level - 1 : depth();
987 char s[20], d[20], l[20];
988 snprintf (s, 19, "%d", Nl);
989 snprintf (d, 19, "%d", depth());
990 snprintf (l, 19, "%d", level);
991 fs = str_append (fs,
992 "const uint N = ", s, ", _depth = ", d, ";\n");
993 if (GPUContext.nssbo == 1) {
994 char size[30];
995 snprintf (size, 29, "%ld", (size_t) field_size());
996 fs = str_append (fs,
997 "const uint _field_size = ", size, ";\n");
998 }
999#ifdef shift_level // multigrid
1000 fs = str_append (fs,
1001 "const uint _shift[_depth + 1] = {");
1002 for (int d = 0; d <= depth(); d++) {
1003 snprintf (s, 19, "%ld", shift_level(d));
1004 fs = str_append (fs, d > 0 ? "," : "", s);
1005 }
1006 fs = str_append (fs, "};\n");
1007#endif // ifdef shift_level
1008 snprintf (s, 19, "{%d,%d}", Dimensions.x, Dimensions.y);
1009 fs = str_append (fs,
1010 "const ivec Dimensions = ", s, ";\n"
1011 "const uint NY = ", loop->face > 1 || loop->vertex ?
1012 "N*Dimensions.y + 1" : "N*Dimensions.y", ";\n");
1013 if (GPUContext.fragment_shader)
1014 fs = str_append (fs, "in vec2 vsPoint;\n"
1015 "Point point = {int((vsPoint.x*vsScale.x + vsOrigin.x)*N*Dimensions.x)"
1016 " + GHOSTS,"
1017 "int((vsPoint.y*vsScale.y + vsOrigin.y)*N*Dimensions.x) + GHOSTS,", l,
1018#if MULTIGRID
1019 ",ivec2(N*Dimensions.x,N*Dimensions.y)"
1020#else
1021 ",N"
1022#endif
1023#if LAYERS
1024 ",0"
1025#endif
1026 "};\n"
1027 "out vec4 FragColor;\n");
1028 else {
1029 char nwgx[20], nwgy[20];
1030 snprintf (nwgx, 19, "%d", nwg[0]);
1031 snprintf (nwgy, 19, "%d", nwg[1]);
1032 fs = str_append (fs, "layout (local_size_x = ", nwgx,
1033 ", local_size_y = ", nwgy, ") in;\n");
1034 }
1035 }
1036 else if (g->type == sym_INT && !strcmp (g->name, "nl")) {
1037
1038 /**
1039 'int nl' gets special treatment. */
1040
1041 char nl[20];
1042 snprintf (nl, 19, "%d", *((int *)g->pointer));
1043 fs = str_append (fs, "const int nl = ", nl, ";\n");
1044 }
1045 else if (g->type == sym_INT && !strcmp (g->name, "bc_period_x"))
1046 fs = str_append (fs, "const int bc_period_x = ", Period.x ?
1047 "int(N*Dimensions.x)" : "-1", ";\n");
1048 else if (g->type == sym_INT && !strcmp (g->name, "bc_period_y"))
1049 fs = str_append (fs, "const int bc_period_y = ", Period.y ?
1050 "int(N*Dimensions.y)" : "-1", ";\n");
1051 else if (GPUContext.fragment_shader && (region->n.x > 1 || region->n.y > 1) &&
1052 g->type == sym_COORD && !strcmp (g->name, "p")) {
1053
1054 /**
1055 'coord p' is assumed to be the parameter of a region. This is
1056 not flexible (the parameter must be called 'p') and should be improved. */
1057
1058 fs = str_append (fs, "coord p = vec3((vsPoint*vsScale + vsOrigin)*L0 + vec2(X0, Y0),0);\n");
1059 }
1060 else if (IS_EXTERNAL_CONSTANT(g)) {
1061 // fixme: for the moment only 'const int' are considered, this could be generalised
1062 char value[20];
1063 assert (g->pointer);
1064 snprintf (value, 19, "%d", *((int *)g->pointer));
1065 fs = str_append (fs, "const ", type_string (g), " ", EXTERNAL_NAME (g), "=", value, ";\n");
1066 }
1067 else if (strcmp (g->name, "Dimensions")) {
1068 char * type = type_string (g);
1069 fs = str_append (fs, "uniform ", type, " ", EXTERNAL_NAME (g));
1070 for (int * d = g->data; d && *d > 0; d++) {
1071 char s[20]; snprintf (s, 19, "%d", *d);
1072 fs = str_append (fs, "[", s, "]");
1073 }
1074 fs = str_append (fs, ";\n");
1075 if (g->reduct) {
1076 fs = str_append (fs, type, " ", g->global == 2 ? "_loc_" : "", g->name, " = ",
1077 EXTERNAL_NAME (g), ";\n");
1078 fs = str_append (fs, "const scalar ", g->name, "_out_ = ");
1079 fs = write_scalar (fs, g->s);
1080 fs = str_append (fs, ";\n");
1081 }
1082 }
1083 }
1084 else { // scalar, vector and tensor fields
1085 int size = list_size (g);
1086 for (int j = 0; j < size; j++) {
1087 if (j == 0) {
1088 fs = str_append (fs, "const ", type_string (g), " ", EXTERNAL_NAME (g));
1089 if (g->nd == 0)
1090 fs = str_append (fs, " = ");
1091 else {
1092 char s[20]; snprintf (s, 19, "%d", size);
1093 fs = str_append (fs, "[", s, "] = {");
1094 }
1095 }
1096 if (g->nd == 0 || j < size - 1) {
1097 if (g->type == sym_SCALAR)
1098 fs = write_scalar (fs, ((scalar *)g->pointer)[j]);
1099 else if (g->type == sym_VECTOR)
1100 fs = write_vector (fs, ((vector *)g->pointer)[j]);
1101 else if (g->type == sym_TENSOR)
1102 fs = write_tensor (fs, ((tensor *)g->pointer)[j]);
1103 else
1104 assert (false);
1105 }
1106 else { // last element of a list is always ignored (this is necessary for empty lists)
1107 if (g->type == sym_SCALAR)
1108 fs = str_append (fs, "{0,0}");
1109 else if (g->type == sym_VECTOR)
1110 fs = str_append (fs, "{{0,0},{0,0}}");
1111 else if (g->type == sym_TENSOR)
1112 fs = str_append (fs, "{{{0,0},{0,0}},{{0,0},{0,0}}}");
1113 else
1114 assert (false);
1115 }
1116 if (g->nd == 0)
1117 fs = str_append (fs, ";\n");
1118 else if (j < size - 1)
1119 fs = str_append (fs, ",");
1120 else
1121 fs = str_append (fs, "};\n");
1122 }
1123 }
1124 }
1125
1126 return fs;
1127}
1128
1129trace
1130Shader * load_shader (const char * fs, uint32_t hash, const ForeachData * loop)
1131{
1132 assert (gpu_grid->shaders);
1133 khiter_t k = kh_get (INT, gpu_grid->shaders, hash);
1134 if (k != kh_end (gpu_grid->shaders)) {
1135 sysfree ((void *)fs);
1136 return kh_value (gpu_grid->shaders, k);
1137 }
1138#if PRINTSHADER
1139 {
1140 static int n = 1;
1141 fprintf (stderr, "=================== %s:%d: shader #%d ===================\n",
1142 loop ? loop->fname : "reduction", loop ? loop->line : 0, n++);
1143 fputs (fs, stderr);
1144 }
1145#endif
1146 GLuint id;
1147 if (!GPUContext.fragment_shader)
1148 id = loadNormalShader (NULL, fs);
1149 else {
1150 const char quad[] =
1151 "#version 430\n"
1152 "layout(location = 0) in vec3 vsPos;"
1153 "out vec2 vsPoint;"
1154 "void main() {"
1155 " vsPoint = vsPos.xy;"
1156 " gl_Position = vec4(2.*vsPos.xy - vec2(1.), 0., 1.);"
1157 "}";
1158 id = loadNormalShader (quad, fs);
1159 }
1160 Shader * shader = NULL;
1161 if (id) {
1162 shader = calloc (1, sizeof (Shader));
1163 shader->id = id;
1164 int ret;
1165 khiter_t k = kh_put (INT, gpu_grid->shaders, hash, &ret);
1166 assert (ret > 0);
1167 kh_value (gpu_grid->shaders, k) = shader;
1168 }
1169 sysfree ((void *)fs);
1170 return shader;
1171}
1172
1173void gpu_limits (FILE * fp)
1174{
1175 GLString * i = gpu_limits_list;
1176 while (i->s) {
1177 GLint val;
1178 GL_C (glGetIntegerv (i->index, &val));
1179 fprintf (fp, "%s: %d\n", i->s, val);
1180 i++;
1181 }
1182}
1183
1184void gpu_free()
1185{
1186 if (!grid)
1187 return;
1188 free_boundaries();
1189 Shader * shader;
1190 int nshaders = 0;
1191 kh_foreach_value (gpu_grid->shaders, shader,
1192 free (shader->uniforms);
1193 free (shader);
1194 nshaders++; );
1195#if PRINTNSHADERS
1196 fprintf (stderr, "# %d shaders\n", nshaders);
1197#endif
1198 kh_destroy (INT, gpu_grid->shaders);
1199 gpu_grid->shaders = NULL;
1200 if (gpu_grid->reduct[0]) {
1201 GL_C (glDeleteBuffers (2, gpu_grid->reduct));
1202 for (int i = 0; i < 2; i++)
1203 gpu_grid->reduct[i] = 0;
1204 }
1205 if (GPUContext.nssbo) {
1206 GL_C (glDeleteBuffers (GPUContext.nssbo, GPUContext.ssbo));
1207 free (GPUContext.ssbo);
1208 GPUContext.ssbo = NULL;
1209 GPUContext.nssbo = 0;
1210 }
1211 GPUContext.current_size = 0;
1212}
1213
1214void gpu_init()
1215{
1216 /* GPUs drivers often generate floating-point exceptions... turn them off */
1217 disable_fpe (FE_DIVBYZERO|FE_INVALID);
1218 if (!GPUContext.window) {
1219 if (!glfwInit ())
1220 exit (1);
1221
1222 glfwWindowHint (GLFW_VISIBLE, GL_FALSE);
1223 glfwWindowHint (GLFW_RESIZABLE, GL_FALSE);
1224 glfwWindowHint (GLFW_CONTEXT_VERSION_MAJOR, 3);
1225 glfwWindowHint (GLFW_CONTEXT_VERSION_MINOR, 3);
1226 glfwWindowHint (GLFW_SAMPLES, 0);
1227 glfwWindowHint (GLFW_OPENGL_PROFILE, GLFW_OPENGL_CORE_PROFILE);
1228 glfwWindowHint (GLFW_OPENGL_FORWARD_COMPAT, GL_TRUE);
1229#if DEBUG_OPENGL
1230 glfwWindowHint (GLFW_OPENGL_DEBUG_CONTEXT, GL_TRUE);
1231#endif
1232
1233 GPUContext.window = glfwCreateWindow (1, 1, "GPU grid", NULL, NULL);
1234 if (!GPUContext.window) {
1235 glfwTerminate();
1236 fprintf (stderr, "GLFW: error: could not create window!\n");
1237 exit (1);
1238 }
1239 glfwMakeContextCurrent (GPUContext.window);
1240
1241 // load GLAD.
1242 assert (gladLoadGLLoader ((GLADloadproc)glfwGetProcAddress));
1243 assert (glBindImageTexture);
1244
1245#if DEBUG_OPENGL
1246 GLint flags;
1247 GL_C (glGetIntegerv (GL_CONTEXT_FLAGS, &flags));
1248 if (flags & GL_CONTEXT_FLAG_DEBUG_BIT) {
1249 GL_C (glEnable(GL_DEBUG_OUTPUT));
1250 GL_C (glEnable(GL_DEBUG_OUTPUT_SYNCHRONOUS));
1251 GL_C (glDebugMessageCallback (GLDebugMessageCallback, NULL));
1252 GL_C (glDebugMessageControl (GL_DONT_CARE, GL_DONT_CARE, GL_DONT_CARE, 0, NULL, GL_TRUE));
1253 }
1254#endif // DEBUG_OPENGL
1255
1256 free_solver_func_add (gpu_free);
1257 free_solver_func_add (gpu_free_solver);
1258 }
1259 gpu_grid->shaders = kh_init (INT);
1260 for (int i = 0; i < 2; i++)
1261 gpu_grid->reduct[i] = 0;
1262
1263 realloc_ssbo();
1264}
1265
1266void gpu_free_grid (void)
1267{
1268 if (!grid)
1269 return;
1270 gpu_free();
1271 free_grid();
1272}
1273
1274attribute {
1275 double (* boundary_left) (Point, Point, scalar, bool *);
1276 double (* boundary_right) (Point, Point, scalar, bool *);
1277 double (* boundary_top) (Point, Point, scalar, bool *);
1278 double (* boundary_bottom) (Point, Point, scalar, bool *);
1279}
1280
1281/**
1282The `stored` attibute tracks where the up-to-date field is stored:
1283
1284* 0: on both the CPU and GPU (i.e. synchronized).
1285* 1: on the CPU.
1286* - 1: on the GPU.
1287*/
1288
1289attribute {
1290 struct {
1291 int stored, index;
1292 } gpu;
1293}
1294
1295trace
1296static void gpu_cpu_sync_scalar (scalar s, char * sep, GLenum mode)
1297{
1298 assert ((mode == GL_MAP_READ_BIT && s.gpu.stored < 0) ||
1299 (mode == GL_MAP_WRITE_BIT && s.gpu.stored > 0));
1300 if (s.gpu.stored > 0 && !(s.stencil.bc & s_centered))
1301 boundary ({s});
1302 GL_C (glMemoryBarrier (GL_BUFFER_UPDATE_BARRIER_BIT));
1303 size_t size = (size_t)field_size()*sizeof(real), offset = s.i*size, totalsize = s.block*size;
1304 char * cd = grid_data() + offset;
1305 int index = offset/GPUContext.max_ssbo_size;
1306 offset -= index*GPUContext.max_ssbo_size;
1307 while (totalsize) {
1308 GL_C (glBindBuffer (GL_SHADER_STORAGE_BUFFER, GPUContext.ssbo[index]));
1309 size_t size = min (totalsize, GPUContext.max_ssbo_size - offset);
1310
1311 // fprintf (stderr, "map %d %ld %ld\n", index, offset, size);
1312
1313 char * gd = glMapBufferRange (GL_SHADER_STORAGE_BUFFER, offset, size, mode);
1314 assert (gd);
1315 if (mode == GL_MAP_READ_BIT)
1316 memcpy (cd, gd, size);
1317 else if (mode == GL_MAP_WRITE_BIT)
1318 memcpy (gd, cd, size);
1319 else
1320 assert (false);
1321 assert (glUnmapBuffer (GL_SHADER_STORAGE_BUFFER));
1322 cd += size, totalsize -= size, offset = 0, index++;
1323 }
1324 GL_C (glBindBuffer (GL_SHADER_STORAGE_BUFFER, 0));
1325 if (sep)
1326 fprintf (stderr, "%s%s", sep, s.name);
1327 s.gpu.stored = 0;
1328}
1329
1330static void gpu_cpu_sync (scalar * list, GLenum mode, const char * fname, int line)
1331{
1332#if PRINTCOPYGPU
1333 bool copy = false;
1334#endif
1335 for (int _s = 0; _s < 1; _s++) /* scalar in list */
1336 if (((s.stencil.io & s_input) || (s.stencil.io & s_output)) &&
1337 ((mode == GL_MAP_READ_BIT && s.gpu.stored < 0) ||
1338 (mode == GL_MAP_WRITE_BIT && s.gpu.stored > 0))) {
1339#if PRINTCOPYGPU
1340 if (!copy) {
1341 fprintf (stderr, "%s:%d: %s ", fname, line,
1342 mode == GL_MAP_READ_BIT ? "importing" : "exporting");
1343 copy = true;
1344 gpu_cpu_sync_scalar (s, "{", mode);
1345 }
1346 else
1347 gpu_cpu_sync_scalar (s, ",", mode);
1348#else
1349 gpu_cpu_sync_scalar (s, NULL, mode);
1350#endif
1351 }
1352#if PRINTCOPYGPU
1353 if (copy)
1354 fprintf (stderr, "} %s GPU\n", mode == GL_MAP_READ_BIT ? "from" : "to");
1355#endif
1356}
1357
1358trace
1359void reset_gpu (void * alist, double val)
1360{
1361 size_t size = (size_t)field_size()*sizeof(real);
1362 scalar * list = alist;
1363 for (int _s = 0; _s < 1; _s++) /* scalar in list */
1364 if (!is_constant(s)) {
1365 size_t offset = s.i*size, totalsize = max(s.block, 1)*size;
1366 int index = offset/GPUContext.max_ssbo_size;
1367 offset -= index*GPUContext.max_ssbo_size;
1368 while (totalsize) {
1369 GL_C (glBindBuffer (GL_SHADER_STORAGE_BUFFER, GPUContext.ssbo[index]));
1370 size_t size = min (totalsize, GPUContext.max_ssbo_size - offset);
1371#if SINGLE_PRECISION
1372 float fval = val;
1373 GL_C (glClearBufferSubData (GL_SHADER_STORAGE_BUFFER, GL_R32F,
1374 offset, size,
1375 GL_RED, GL_FLOAT, &fval));
1376#else
1377 GL_C (glClearBufferSubData (GL_SHADER_STORAGE_BUFFER, GL_RG32UI,
1378 offset, size,
1379 GL_RG_INTEGER, GL_UNSIGNED_INT, &val));
1380#endif
1381 totalsize -= size, offset = 0, index++;
1382 }
1383 s.gpu.stored = -1;
1384 }
1385 GL_C (glBindBuffer (GL_SHADER_STORAGE_BUFFER, 0));
1386}
1387
1388#define reset(...) reset_gpu (__VA_ARGS__)
1389
1390void gpu_init_grid (int n)
1391{
1392 if (grid && n == N)
1393 return;
1394 gpu_free_grid();
1395 init_grid (n);
1396 grid = realloc (grid, sizeof (GridGPU));
1397 gpu_init();
1398}
1399
1400// overload the default various functions
1401
1402#define init_grid(n) gpu_init_grid(n)
1403#undef free_grid
1404#define free_grid() gpu_free_grid()
1405
1406#include "reduction.h"
1407
1408static External * append_external (External * externals, External ** end, External * g)
1409{
1410 if (externals)
1411 (*end)->next = g;
1412 else
1413 externals = g;
1414 *end = g;
1415 (*end)->next = NULL;
1416 return externals;
1417}
1418
1419static External * merge_external (External * externals, External ** end, External * g,
1420 const ForeachData * loop)
1421{
1422 if (g->type == sym_function_declaration || g->type == sym_function_definition) {
1423 bool boundary = is_boundary_attribute (g);
1424 for (int _s = 0; _s < 1; _s++) /* scalar in baseblock */
1425 if (g->name[0] != '.' || (!boundary && s.gpu.index) ||
1426 (boundary && (s.stencil.io & s_output))) {
1427 void * ptr = g->name[0] != '.' ? g->pointer :
1428 *((void **)(((char *) &_attribute[s.i]) + g->nd));
1429 if (ptr) {
1430 External * p = _get_function ((long) ptr);
1431 if (!p) {
1432 fprintf (stderr, "%s:%d: GLSL: error: unregistered function pointer '%s'\n",
1433 loop->fname, loop->line, g->name);
1434 return NULL;
1435 }
1436 if (!p->used) {
1437 p->used = true;
1438 for (External * i = p->externals; i && i->name; i++)
1439 externals = merge_external (externals, end, i, loop);
1440 externals = append_external (externals, end, p);
1441 }
1442 }
1443 if (g->name[0] != '.')
1444 break;
1445 }
1446 }
1447 for (External * i = externals; i; i = i->next)
1448 if (!strcmp (g->name, i->name)) {
1449
1450 /**
1451 Check whether a local *g* (resp. *i*) shadows a global *i* (resp
1452 *g*). */
1453
1454 if (g->global == 0 && i->global == 1) g->global = 2;
1455 else if (g->global == 1 && i->global == 0) i->global = 2;
1456 else // already in the list
1457 return externals;
1458 }
1459 return append_external (externals, end, g);
1460}
1461
1462static External * merge_externals (External * externals, const ForeachData * loop)
1463{
1464 External * merged = NULL, * end = NULL;
1465 static External ext[] = {
1466 { .name = "X0", .type = sym_DOUBLE, .pointer = &X0, .global = 1 },
1467 { .name = "Y0", .type = sym_DOUBLE, .pointer = &Y0, .global = 1 },
1468 { .name = "Z0", .type = sym_DOUBLE, .pointer = &Z0, .global = 1 },
1469 { .name = "L0", .type = sym_DOUBLE, .pointer = &L0, .global = 1 },
1470 { .name = "N", .type = sym_INT, .pointer = &N, .global = 1 },
1471#if LAYERS
1472 { .name = "nl", .type = sym_INT, .pointer = &nl, .global = 1 },
1473 { .name = "_layer", .type = sym_INT, .pointer = &_layer, .global = 1 },
1474 { .name = ".block", .type = sym_INT, .nd = attroffset (block) },
1475#endif
1476 { .name = NULL }
1477 };
1478 static External bc = {
1479 .name = "apply_bc", .type = sym_function_declaration, .pointer = (void *)(long)apply_bc
1480 };
1481
1482 for (External * g = externals; g->name; g++) {
1483 g->used = false;
1484 if (g->global == 2) g->global = 0;
1485 }
1486 foreach_function (f, f->used = false);
1487 for (External * g = ext; g->name; g++) {
1488 g->used = false;
1489 merged = merge_external (merged, &end, g, loop);
1490 }
1491 if (loop->dirty) {
1492 bc.used = false;
1493 merged = merge_external (merged, &end, &bc, loop);
1494 }
1495 for (External * g = externals; g->name; g++)
1496 merged = merge_external (merged, &end, g, loop);
1497#if PRINTEXTERNAL
1498 for (External * i = merged; i; i = i->next)
1499 fprintf (stderr, "external %s %d %p %d\n", i->name, i->type, i->pointer, i->global);
1500#endif
1501 if (loop->dirty)
1502 for (External * g = merged; g; g = g->next)
1503 if (g->global && !strcmp (g->name, "apply_bc_list"))
1504 g->pointer = loop->dirty;
1505 return merged;
1506}
1507
1508trace
1509static Shader * compile_shader (ForeachData * loop,
1510 uint32_t hash,
1511 const RegionParameters * region,
1512 External * externals,
1513 const char * kernel)
1514{
1515 const char * error = strstr (kernel, "@error ");
1516 if (error) {
1517 for (const char * s = error + 7; *s != '\n' && *s != '\0'; s++)
1518 fputc (*s, stderr);
1519 fputc ('\n', stderr);
1520 loop->data = NULL;
1521 return NULL;
1522 }
1523
1524 External * merged = merge_externals (externals, loop);
1525 if (!merged) {
1526 loop->data = NULL;
1527 return NULL;
1528 }
1529
1530 int local = false;
1531 for (const External * g = merged; g; g = g->next) {
1532 if (g->global == 2)
1533 local = true;
1534 if (g->type != sym_SCALAR && g->type != sym_VECTOR && g->type != sym_TENSOR) {
1535 if (g->reduct && !strchr ("+mM", g->reduct)) {
1536 if (loop->first)
1537 fprintf (stderr,
1538 "%s:%d: GLSL: error: unknown reduction operation '%c'\n",
1539 loop->fname, loop->line, g->reduct);
1540 return NULL;
1541 }
1542 if (g->type == sym_COORD || g->type == sym__COORD || g->type == sym_VEC4) {
1543 if (g->reduct) {
1544 if (loop->first)
1545 fprintf (stderr,
1546 "%s:%d: GLSL: error: reductions not implemented for '%s' type\n",
1547 loop->fname, loop->line, type_string (g));
1548 return NULL;
1549 }
1550 }
1551 else if (g->type != sym_FLOAT &&
1552 g->type != sym_DOUBLE &&
1553 g->type != sym_INT &&
1554 g->type != sym_LONG &&
1555 g->type != sym_BOOL &&
1556 g->type != sym_enumeration_constant &&
1557 g->type != sym_IVEC &&
1558 g->type != sym_function_declaration &&
1559 g->type != sym_function_definition) {
1560 if (loop->first)
1561 fprintf (stderr, "%s:%d: GLSL: error: unknown type %d for '%s'\n",
1562 loop->fname, loop->line, g->type, g->name);
1563 return NULL;
1564 }
1565 }
1566 }
1567
1568 /**
1569 ## Number of compute shader work groups and groups */
1570
1571 static const int NWG[2] = {16, 16};
1572 GLuint ng[2], nwg[2];
1573 int Nl = region->level > 0 ? 1 << (region->level - 1) : N/Dimensions.x;
1574 int * dims = &Dimensions.x;
1575 if (loop->face || loop->vertex)
1576 for (int i = 0; i < 2; i++) {
1577 if (Nl*dims[1-i] > NWG[i]) {
1578 nwg[i] = NWG[i] + 1;
1579 ng[i] = Nl*dims[1-i]/NWG[i];
1580 }
1581 else {
1582 nwg[i] = Nl*dims[1-i] + 1;
1583 ng[i] = 1;
1584 }
1585 assert (nwg[i]*ng[i] >= Nl*dims[1-i] + 1);
1586 }
1587 else
1588 for (int i = 0; i < 2; i++) {
1589 nwg[i] = Nl < NWG[i] ? Nl : NWG[i];
1590 ng[i] = Nl*dims[1-i]/nwg[i];
1591 assert (nwg[i]*ng[i] == Nl*dims[1-i]);
1592 }
1593
1594 char * shader = build_shader (merged, loop, region, nwg);
1595 if (!shader)
1596 return NULL;
1597
1598 /**
1599 ## main() */
1600
1601 if (local) {
1602 shader = str_append (shader, "void _loop (");
1603 for (const External * g = merged; g; g = g->next)
1604 if (g->global == 2) {
1605 shader = str_append (shader, local++ == 1 ? "" : ", ", type_string (g), " ", g->name);
1606 if (g->nd) {
1607 int size = list_size (g);
1608 if (size > 0) {
1609 char s[20]; snprintf (s, 19, "%d", size);
1610 shader = str_append (shader, "[", s, "]");
1611 }
1612 }
1613 }
1614 shader = str_append (shader, ") {\n");
1615 }
1616 else
1617 shader = str_append (shader, "void main() {\n");
1618
1619 if (!GPUContext.fragment_shader) {
1620 char d[20];
1621 snprintf (d, 19, "%d", region->level > 0 ? region->level - 1 : depth());
1622 shader = str_append (shader, "Point point = {csOrigin.x + int(gl_GlobalInvocationID.y) + GHOSTS,"
1623 "csOrigin.y + int(gl_GlobalInvocationID.x) + GHOSTS,", d,
1624#if MULTIGRID
1625 ",ivec2((1<<",d,")*Dimensions.x,(1<<",d,")*Dimensions.y)"
1626#else
1627 ",N"
1628#endif
1629#if LAYERS
1630 ",0};\n"
1631#else
1632 "};\n"
1633#endif
1634 );
1635 }
1636 shader = str_append (shader,
1637 "if (point.i < N*Dimensions.x + 2*GHOSTS && "
1638 "point.j < N*Dimensions.y + 2*GHOSTS) {\n");
1639 if (loop->vertex)
1640 shader = str_append (shader, " int ig = -1, jg = -1;\n");
1641 shader = str_append (shader, kernel);
1642 shader = str_append (shader, "\nif (point.j - GHOSTS < NY) {");
1643 for (const External * g = merged; g; g = g->next)
1644 if (g->reduct) {
1645 shader = str_append (shader, "\n val_red_(", g->name, "_out_) = ", g->name, ";");
1646 scalar s = g->s;
1647 s.gpu.stored = -1;
1648 }
1649 shader = str_append (shader, "\n}",
1650 loop->dirty ? "apply_bc(point);" : "",
1651 "}}\n");
1652
1653 if (local) {
1654 shader = str_append (shader, "void main(){_loop(");
1655 local = 1;
1656 for (const External * g = merged; g; g = g->next)
1657 if (g->global == 2)
1658 shader = str_append (shader, local++ == 1 ? "" : ",", EXTERNAL_NAME (g));
1659 shader = str_append (shader, ");}\n");
1660 }
1661
1662 Shader * s = load_shader (shader, hash, loop);
1663 loop->data = s;
1664 if (!s)
1665 return NULL;
1666 s->ng[0] = ng[0], s->ng[1] = ng[1];
1667
1668 /**
1669 ## Make list of uniforms */
1670
1671 for (External * g = merged; g; g = g->next)
1672 g->used = 0;
1673 int index = 1;
1674 for (External * g = externals; g && g->name; g++)
1675 g->used = index++;
1676 int nuniforms = 0;
1677 for (const External * g = merged; g; g = g->next) {
1678 if (g->name[0] == '.') continue;
1679 if (IS_EXTERNAL_CONSTANT(g)) continue;
1680 if (g->type == sym_function_declaration || g->type == sym_function_definition) continue;
1681 if (g->type == sym_INT && (!strcmp (g->name, "N") ||
1682 !strcmp (g->name, "nl") ||
1683 !strcmp (g->name, "bc_period_x") ||
1684 !strcmp (g->name, "bc_period_y")))
1685 continue;
1686 if (g->type == sym_INT ||
1687 g->type == sym_LONG ||
1688 g->type == sym_FLOAT ||
1689 g->type == sym_DOUBLE ||
1690 g->type == sym__COORD ||
1691 g->type == sym_COORD ||
1692 g->type == sym_BOOL ||
1693 g->type == sym_VEC4) {
1694 char * name = str_append (NULL, EXTERNAL_NAME (g));
1695 int location = glGetUniformLocation (s->id, name);
1696 sysfree (name);
1697 if (location >= 0) {
1698 // fprintf (stderr, "%s:%d: %s\n", loop->fname, loop->line, name);
1699 // not an array or just a one-dimensional array
1700 assert (!g->nd);
1701 assert (!g->data || ((int *)g->data)[1] == 0);
1702 int nd = g->data ? ((int *)g->data)[0] : 1;
1703 s->uniforms = realloc (s->uniforms, (nuniforms + 2)*sizeof(MyUniform));
1704 s->uniforms[nuniforms] = (MyUniform){
1705 .location = location, .type = g->type, .nd = nd,
1706 .local = g->global == 1 ? -1 : g->used - 1,
1707 .pointer = g->global == 1 ? g->pointer : NULL };
1708 s->uniforms[nuniforms + 1].type = 0;
1709 nuniforms++;
1710 // uniforms refering to local variables must be in the 'externals' local list
1711 assert (g->global == 1 || g->used);
1712 }
1713 }
1714 }
1715
1716 return s;
1717}
1718
1719static
1720void free_reduction_fields (const External * externals)
1721{
1722 for (const External * g = externals; g; g = g->next)
1723 if (g->reduct) {
1724 scalar s = g->s;
1725 delete ({s});
1726 }
1727}
1728
1729trace
1730static Shader * setup_shader (ForeachData * loop, const RegionParameters * region,
1731 External * externals,
1732 const char * kernel)
1733{
1734 /**
1735 We will directly apply boundary conditions to fields marked 'dirty'
1736 by automatic stencils. */
1737
1738 apply_bc_list = loop->dirty;
1739 for (int _s = 0; _s < 1; _s++) /* scalar in loop->dirty */ {
1740
1741 /**
1742 We also apply boundary stencils so that input/output are also set
1743 properly for boundary conditions which may use external fields. */
1744
1745 for (int b = 0; b < nboundary; b++)
1746 if (s.boundary_stencil[b])
1747 s.boundary_stencil[b] ((Point){0}, (Point){0}, s, NULL);
1748
1749 /**
1750 We make sure all fields marked dirty are also outputs. */
1751
1752 if (!(s.stencil.io & s_output))
1753 s.stencil.io |= s_output;
1754 }
1755
1756 for (int _s = 0; _s < 1; _s++) /* scalar in baseblock */
1757 s.gpu.index = 0;
1758 int index = 1;
1759 for (int _s = 0; _s < 1; _s++) /* scalar in baseblock */
1760 if (((s.stencil.io & s_input) || (s.stencil.io & s_output)) && !s.gpu.index) {
1761#if PRINTBOUNDARY
1762 fprintf (stderr, "%s:%d: %s:%s%s index: %d\n",
1763 loop->fname, loop->line,
1764 s.name, (s.stencil.io & s_input) ? " input" : "",
1765 (s.stencil.io & s_output) ? " output" : "", index);
1766#endif
1767 if (s.v.x.i == -1) // scalar
1768 s.gpu.index = index++;
1769 else { // vector
1770 vector v = s.v;
1771 for (int _c = 0; _c < 1; _c++) /* scalar in {v} */
1772 if (!c.gpu.index)
1773 c.gpu.index = index++;
1774 }
1775 }
1776 for (int _s = 0; _s < 1; _s++) /* scalar in loop->dirty */ {
1777#if PRINTBOUNDARY
1778 fprintf (stderr, "%s:%d: dirty: %s:%s%s index: %d\n",
1779 loop->fname, loop->line, s.name,
1780 (s.stencil.io & s_input) ? " input" : "",
1781 (s.stencil.io & s_output) ? " output" : "",
1782 s.gpu.index);
1783#endif
1784 s.boundary_left = s.boundary[left];
1785 s.boundary_right = s.boundary[right];
1786 s.boundary_top = s.boundary[top];
1787 s.boundary_bottom = s.boundary[bottom];
1788 }
1789
1790 /**
1791 ## Allocate reduction fields */
1792
1793 for (External * g = externals; g && g->name; g++)
1794 if (g->reduct) {
1795 scalar s = {0} /* new scalar */;
1796 s.stencil.io |= s_output;
1797 g->s = s;
1798#if PRINTREDUCT
1799 fprintf (stderr, "%s:%d: new reduction field %d for %s\n",
1800 loop->fname, loop->line, s.i, g->name);
1801#endif
1802 }
1803
1804 /**
1805 ## Reuse or compile a new shader */
1806
1807 Shader * shader;
1808 uint32_t hash = hash_shader (externals, loop, region, kernel);
1809 assert (gpu_grid->shaders);
1810 khiter_t k = kh_get (INT, gpu_grid->shaders, hash);
1811 if (k != kh_end (gpu_grid->shaders))
1812 shader = kh_value (gpu_grid->shaders, k);
1813 else {
1814 shader = compile_shader (loop, hash, region, externals, kernel);
1815 if (!shader) {
1816 free_reduction_fields (externals);
1817 return NULL;
1818 }
1819 }
1820
1821 gpu_cpu_sync (baseblock, GL_MAP_WRITE_BIT, loop->fname, loop->line);
1822
1823 /**
1824 ## Apply boundary conditions
1825
1826 This can be required if boundary conditions have been modified
1827 between loops. */
1828
1829 scalar * listc = NULL;
1830 for (int _s = 0; _s < 1; _s++) /* scalar in loop->listc */
1831 if (!(s.stencil.bc & s_centered))
1832 listc = list_prepend (listc, s);
1833 scalar * listf_x = NULL, * listf_y = NULL;
1834 for (int _d = 0; _d < dimension; _d++)
1835 for (int _s = 0; _s < 1; _s++) /* scalar in loop->listf.x */
1836 if (!(s.stencil.bc & s_face))
1837 listf_x = list_prepend (listf_x, s);
1838 if (listc || listf_x || listf_y) {
1839#if PRINTBC
1840 fprintf (stderr, "%s:%d: applying BCs for", loop->fname, loop->line);
1841 for (int _s = 0; _s < 1; _s++) /* scalar in listc */
1842 fprintf (stderr, " %s", s.name);
1843 for (int _d = 0; _d < dimension; _d++)
1844 for (int _s = 0; _s < 1; _s++) /* scalar in listf_x */
1845 fprintf (stderr, " %s", s.name);
1846 fputc ('\n', stderr);
1847#endif
1848 int nvar = datasize/sizeof(real);
1849 _Attributes backup[nvar];
1850 memcpy (backup, _attribute, nvar*sizeof(_Attributes));
1851 foreach (gpu) {
1852 for (int _s = 0; _s < 1; _s++) /* scalar in listc */
1853 s[] = s[]; // does nothing but ensures that BCs are applied at the end of the loop
1854 for (int _d = 0; _d < dimension; _d++)
1855 for (int _s = 0; _s < 1; _s++) /* scalar in listf_x */
1856 s[] = s[];
1857 }
1858 memcpy (_attribute, backup, nvar*sizeof(_Attributes));
1859 for (int _s = 0; _s < 1; _s++) /* scalar in listc */ {
1860 s.stencil.bc |= s_centered;
1861 s.gpu.stored = -1;
1862 }
1863 free (listc);
1864 for (int _d = 0; _d < dimension; _d++) {
1865 for (int _s = 0; _s < 1; _s++) /* scalar in listf_x */ {
1866 s.stencil.bc |= s_face;
1867 s.gpu.stored = -1;
1868 }
1869 free (listf_x);
1870 }
1871 apply_bc_list = loop->dirty;
1872 }
1873
1874 /**
1875 For the Intel driver, it looks like the next line is necessary to
1876 ensure proper synchronisation of the compute shader and fragment
1877 shader (for example when using output_ppm() for interactive
1878 display). The nvidia driver somehow does not need this... */
1879
1880 if (shader->id != GPUContext.current_shader) {
1881 GL_C (glBindBufferBase (GL_SHADER_STORAGE_BUFFER, 0, 0));
1882 GL_C (glUseProgram (shader->id));
1883 for (int i = 0; i < GPUContext.nssbo; i++)
1884 GL_C (glBindBufferBase (GL_SHADER_STORAGE_BUFFER, i, GPUContext.ssbo[i]));
1885 GPUContext.current_shader = shader->id;
1886 }
1887
1888 /**
1889 ## Set uniforms */
1890
1891 for (const MyUniform * g = shader->uniforms; g && g->type; g++) {
1892 void * pointer = g->pointer;
1893 if (!pointer) {
1894 assert (g->local >= 0);
1895 pointer = externals[g->local].pointer;
1896 }
1897 switch (g->type) {
1898 case sym_INT:
1899 glUniform1iv (g->location, g->nd, pointer); break;
1900 case sym_FLOAT:
1901 glUniform1fv (g->location, g->nd, pointer); break;
1902 case sym_VEC4:
1903 glUniform4fv (g->location, g->nd, pointer); break;
1904 case sym_BOOL: {
1905 int p[g->nd];
1906 bool * data = pointer;
1907 for (int i = 0; i < g->nd; i++)
1908 p[i] = data[i];
1909 glUniform1iv (g->location, g->nd, p);
1910 break;
1911 }
1912 case sym_LONG: {
1913 int p[g->nd];
1914 long * data = pointer;
1915 for (int i = 0; i < g->nd; i++)
1916 p[i] = data[i];
1917 glUniform1iv (g->location, g->nd, p);
1918 break;
1919 }
1920#if SINGLE_PRECISION
1921 case sym_DOUBLE: {
1922 float p[g->nd];
1923 double * data = pointer;
1924 for (int i = 0; i < g->nd; i++)
1925 p[i] = data[i];
1926 glUniform1fv (g->location, g->nd, p);
1927 break;
1928 }
1929 case sym__COORD: {
1930 float p[2*g->nd];
1931 double * data = pointer;
1932 for (int i = 0; i < 2*g->nd; i++)
1933 p[i] = data[i];
1934 glUniform2fv (g->location, g->nd, p);
1935 break;
1936 }
1937 case sym_COORD: {
1938 float p[3*g->nd];
1939 double * data = pointer;
1940 for (int i = 0; i < 3*g->nd; i++)
1941 p[i] = data[i];
1942 glUniform3fv (g->location, g->nd, p);
1943 break;
1944 }
1945#else // DOUBLE_PRECISION
1946 case sym_DOUBLE:
1947 glUniform1dv (g->location, g->nd, pointer); break;
1948 case sym__COORD:
1949 glUniform2dv (g->location, g->nd, pointer); break;
1950 case sym_COORD:
1951 glUniform3dv (g->location, g->nd, pointer); break;
1952#endif // DOUBLE_PRECISION
1953 default:
1954 assert (false);
1955 }
1956 }
1957
1958 return shader;
1959}
1960
1961static bool doloop_on_gpu (ForeachData * loop, const RegionParameters * region,
1962 External * externals,
1963 const char * kernel)
1964{
1965 Shader * shader = setup_shader (loop, region, externals, kernel);
1966 if (!shader)
1967 return false;
1968
1969 /**
1970 ## Render
1971
1972 If this is a `foreach_point()` iteration, we draw a single point */
1973
1974 int Nl = region->level > 0 ? 1 << (region->level - 1) : N/Dimensions.x;
1975 if (region->n.x == 1 && region->n.y == 1) {
1976 int csOrigin[] = {
1977 (region->p.x - X0)/L0*Nl*Dimensions.x,
1978 (region->p.y - Y0)/L0*Nl*Dimensions.x
1979 };
1980 GL_C (glUniform2iv (0, 1, csOrigin));
1981 assert (!GPUContext.fragment_shader);
1982 GL_C (glMemoryBarrier (GL_SHADER_STORAGE_BARRIER_BIT));
1983 GL_C (glDispatchCompute (1, 1, 1));
1984 }
1985
1986 /**
1987 This is a region */
1988
1989 else if (region->n.x || region->n.y) {
1990 float vsScale[] = {
1991 (region->box[1].x - region->box[0].x)/L0,
1992 (region->box[1].y - region->box[0].y)/L0
1993 };
1994 float vsOrigin[] = { (region->box[0].x - X0)/L0, (region->box[0].y - Y0)/L0 };
1995 GL_C (glUniform2fv (1, 1, vsOrigin));
1996 GL_C (glUniform2fv (2, 1, vsScale));
1997 assert (GPUContext.fragment_shader);
1998 GL_C (glMemoryBarrier (GL_SHADER_STORAGE_BARRIER_BIT));
1999 GL_C (glDrawArrays (GL_TRIANGLES, 0, 6));
2000 }
2001
2002 else {
2003 assert (!GPUContext.fragment_shader);
2004 GL_C (glMemoryBarrier (GL_SHADER_STORAGE_BARRIER_BIT));
2005 GL_C (glDispatchCompute (shader->ng[0], shader->ng[1], 1));
2006 }
2007
2008 /**
2009 ## Perform reductions and cleanup */
2010
2011 bool nreductions = false;
2012 for (const External * g = externals; g && g->name; g++)
2013 if (g->reduct) {
2014 nreductions = true;
2015 break;
2016 }
2017 if (nreductions)
2018 tracing ("gpu_reduction", loop->fname, loop->line);
2019 for (const External * g = externals; g && g->name; g++)
2020 if (g->reduct) {
2021 scalar s = g->s;
2022 double result = gpu_reduction (field_offset(s, region->level), g->reduct, region,
2023 loop->face == 1 ? (Nl*Dimensions.x + 1)*Nl*Dimensions.y :
2024 loop->face == 2 ? Nl*Dimensions.x*(Nl*Dimensions.y + 1) :
2025 loop->face == 3 || loop->vertex ?
2026 (Nl*Dimensions.x + 1)*(Nl*Dimensions.y + 1) :
2027 sq(Nl)*Dimensions.x*Dimensions.y);
2028#if PRINTREDUCT
2029 fprintf (stderr, "%s:%d: %s %c %g\n",
2030 loop->fname, loop->line, g->name, g->reduct, result);
2031#endif
2032 if (g->type == sym_DOUBLE) *((double *)g->pointer) = result;
2033 else if (g->type == sym_FLOAT) *((float *)g->pointer) = result;
2034 else if (g->type == sym_INT) *((int *)g->pointer) = result;
2035 else
2036 assert (false);
2037 delete ({s});
2038 }
2039 if (nreductions)
2040 end_tracing ("gpu_reduction", loop->fname, loop->line);
2041
2042 return true;
2043}
2044
2045bool gpu_end_stencil (ForeachData * loop,
2046 const RegionParameters * region,
2047 External * externals,
2048 const char * kernel)
2049{
2050 bool on_gpu = ((loop->parallel == 1 && !on_cpu) || loop->parallel == 3) &&
2051 (loop->first || loop->data);
2052 if (on_gpu) {
2053 on_gpu = doloop_on_gpu (loop, region, externals, kernel);
2054 if (!on_gpu) {
2055 fprintf (stderr, "%s:%d: %s: for (int _i = 0; _i < _N; _i++) /* foreach */ done on CPU (see GLSL errors above)\n",
2056 loop->fname, loop->line, loop->parallel == 3 ? "error" : "warning");
2057 if (loop->parallel == 3) // must run on GPU but cannot run
2058 exit (1);
2059 loop->data = NULL;
2060 }
2061 }
2062 if (on_gpu) {
2063 // do not apply BCs on CPU
2064 free (loop->listc), loop->listc = NULL;
2065 for (int _d = 0; _d < dimension; _d++)
2066 free (loop->listf.x), loop->listf.x = NULL;
2067 for (int _s = 0; _s < 1; _s++) /* scalar in loop->dirty */
2068 s.stencil.bc = s_centered|s_face;
2069 free (loop->dirty), loop->dirty = NULL;
2070 }
2071 else {
2072 gpu_cpu_sync (baseblock, GL_MAP_READ_BIT, loop->fname, loop->line);
2073 boundary_stencil (loop);
2074 }
2075
2076 for (int _s = 0; _s < 1; _s++) /* scalar in baseblock */
2077 if (s.stencil.io & s_output)
2078 s.gpu.stored = on_gpu ? -1 : 1;
2079
2080 return on_gpu && loop->parallel != 3;
2081}
2082
2083/**
2084## Useful (and not so useful) links
2085
2086* [Core Language (GLSL)](https://www.khronos.org/opengl/wiki/Core_Language_(GLSL))
2087* [Image Load Store](https://www.khronos.org/opengl/wiki/Image_Load_Store)
2088* [Persistent Mapped Buffers in OpenGL](https://www.cppstories.com/2015/01/persistent-mapped-buffers-in-opengl/)
2089* [Best Practices for Working with Texture Data (OpenGL Programming Guide for Mac)](https://developer.apple.com/library/archive/documentation/GraphicsImaging/Conceptual/OpenGL-MacProgGuide/opengl_texturedata/opengl_texturedata.html)
2090* [https://github.com/Erkaman/fluid_sim]()
2091* [https://stackoverflow.com/questions/67529148/how-can-i-optimize-a-compute-shader-thats-heavy-on-imageload-calls]()
2092* [https://www.slideshare.net/Mark_Kilgard/siggraph-2012-nvidia-opengl-for-2012]()
2093* [https://stackoverflow.com/questions/7954927/passing-a-list-of-values-to-fragment-shader]()
2094* [https://computergraphics.stackexchange.com/questions/5416/why-use-image-load-stores-instead-of-fbos]()
2095* [https://computergraphics.stackexchange.com/questions/9956/performance-of-compute-shaders-vs-fragment-shaders-for-deferred-rendering]()
2096* [https://computergraphics.stackexchange.com/questions/54/when-is-a-compute-shader-more-efficient-than-a-pixel-shader-for-image-filtering]()
2097* [http://diaryofagraphicsprogrammer.blogspot.com/2014/03/compute-shader-optimizations-for-amd.html]()
2098* [DirectCompute: Optimizations and Best Practices, Eric Young, NVIDIA Corporation, 2010](http://on-demand.gputechconf.com/gtc/2010/presentations/S12312-DirectCompute-Pre-Conference-Tutorial.pdf)
2099* [Compute shaders in graphics: Gaussian blur](https://lisyarus.github.io/blog/graphics/2022/04/21/compute-blur.html)
2100* [Arm Mali GPUs Best Practices Developer Guide](https://armkeil.blob.core.windows.net/developer/Arm%20Developer%20Community/PDF/Arm%20Mali%20GPU%20Best%20Practices.pdf)
2101* [Rendering to a 3D texture](https://community.khronos.org/t/rendering-to-a-3d-texture/75285/2)
2102* [GLFW Shared Contexts](https://www.glfw.org/docs/3.3/context_guide.html#context_sharing)
2103* [Slides on using Array or Bindless Textures](https://www.slideshare.net/CassEveritt/beyond-porting)
2104* [Optimizing Compute Shaders for L2 Locality using Thread-Group ID Swizzling](https://developer.nvidia.com/blog/optimizing-compute-shaders-for-l2-locality-using-thread-group-id-swizzling/)
2105* [Optimizing GPU occupancy and resource usage with large thread groups](https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-usage-large-thread-groups/)
2106* [https://www.reddit.com/r/CUDA/comments/lkhcbv/is_there_a_list_of_gpus_ranked_by_fp64/]()
2107* [https://www.techpowerup.com/gpu-specs/]()
2108*/
const vector a
Definition all-mach.h:59
define k
int y
Definition common.h:76
int x
Definition common.h:76
else return n
Definition curvature.h:101
char * error
Definition display.h:101
scalar s
Definition embed-tree.h:56
scalar int i
Definition embed.h:74
static int line
Definition errors.c:754
void includes(int argc, char **argv, char **grid1, int *default_grid, int *dim, int *bg, int *lyrs, int *gpus, const char *dir)
Definition include.c:2886
int code(int p, int l)
Definition linear.h:44
static bool which(const char *command)
Definition output.h:389
static int run
Definition qcc.c:63
before each foreach loop
Definition stencils.h:11
def _i
Definition stencils.h:405
scalar c
Definition vof.h:57