<div dir="ltr"><div>Hello Haskell Cafe,</div><div><br></div><div>I have written a small, pretty simple program but I am finding it hard to reason about its behavior (and also about the best way to do what I want), so I would like to ask you all for some suggestions.</div><div><br></div><div>For reference, here's a <a href="https://stackoverflow.com/questions/48330690/haskell-conduit-aeson-parsing-large-jsons-and-filter-matching-key-values/48348153#48348153">Stack Overflow question</a> where I described what's going on, but I'll also describe it below.</div><div><br></div><div>My program does the following:<br></div><div><ol><li>Recursively list a directory,</li><li>Parse the JSON files from the directory list into identifiable objects/records,<br></li><li>Look for matching key-value pairs, and</li><li>Return filenames where matches have been found.</li></ol><div>A few details for more context:</div><ul><li>I have to filter between 500,000 and 1 million files (I'm typically trying to reduce down to between 1,000 and 40,000 that represent a particular project). I usually just need the filenames.<br></li><li>Each file is quite large, some of them 5mb or 10mb, and it's not uncommon for them to have deeply nested keys (40,000 keys or so).</li></ul><div>My first version of this program was simple, synchronous, and as straightforward as I could come up with. However, the memory usage increased monotonically. Profiling, I found that most of the time was spent in JSON-parsing into Objects before my code could turn the objects into records (also, as you might imagine, tons of time in garbage collection).<br></div><div><br></div><div>For my second version, I switched to conduit and it seemed to solve the increasing memory issue. My core function now looked like this:</div><div><pre class="gmail-lang-hs gmail-prettyprint gmail-prettyprinted"><code><span class="gmail-pln">conduitFilesFilter </span><span class="gmail-pun">::</span><span class="gmail-pln"> ProjectFilter </span><span class="gmail-pun">-></span><span class="gmail-pln"> Path Abs Dir </span><span class="gmail-pun">-></span><span class="gmail-pln"> IO </span><span class="gmail-pun">[</span><span class="gmail-pln">Path Abs File</span><span class="gmail-pun">]</span><span class="gmail-pln">
conduitFilesFilter projFilter dirname' </span><span class="gmail-pun">=</span><span class="gmail-pln"> </span><span class="gmail-kwd">do</span><span class="gmail-pln">
  </span><span class="gmail-pun">(_,</span><span class="gmail-pln"> allFiles</span><span class="gmail-pun">)</span><span class="gmail-pln"> </span><span class="gmail-pun"><-</span><span class="gmail-pln"> listDirRecur dirname'
  C.runConduit </span><span class="gmail-pun">$</span><span class="gmail-pln">
    C.yieldMany allFiles
    </span><span class="gmail-pun">.|</span><span class="gmail-pln"> C.filterMC </span><span class="gmail-pun">(</span><span class="gmail-pln">filterMatchingFile projFilter</span><span class="gmail-pun">)</span><span class="gmail-pln">
    </span><span class="gmail-pun">.|</span><span class="gmail-pln"> C.sinkList</span></code></pre></div><div><br></div><div>This was still slow and certainly still synchronous. What I really wanted was to run that "filterMatchingFile..." part in parallel across a number of CPUs. As an aside, my filtering function looks like this:</div><div><br></div><div><span style="font-family:monospace,monospace">filterMatchingFile :: ProjectFilter -> Path Abs File -> IO Bool<br>filterMatchingFile (ProjectFilter filterFunc) fpath = do<br>  let fp = toFilePath fpath<br>  bs <- B.readFile fp<br>  case validImplProject bs of  -- this is pretty much just `decodeStrict`<br>    Nothing -> pure False<br>    (Just proj') -> pure $ filterFunc proj'</span><br></div><div><br></div><div>Here are the stats from running this:</div><div><br></div><div><pre class="gmail-lang-hs gmail-prettyprint gmail-prettyprinted"><code><span class="gmail-lit">115</span><span class="gmail-pun">,</span><span class="gmail-lit">961</span><span class="gmail-pun">,</span><span class="gmail-lit">554</span><span class="gmail-pun">,</span><span class="gmail-lit">600</span><span class="gmail-pln"> bytes allocated </span><span class="gmail-kwd">in</span><span class="gmail-pln"> the heap
  </span><span class="gmail-lit">35</span><span class="gmail-pun">,</span><span class="gmail-lit">870</span><span class="gmail-pun">,</span><span class="gmail-lit">639</span><span class="gmail-pun">,</span><span class="gmail-lit">768</span><span class="gmail-pln"> bytes copied during GC
      </span><span class="gmail-lit">56</span><span class="gmail-pun">,</span><span class="gmail-lit">467</span><span class="gmail-pun">,</span><span class="gmail-lit">720</span><span class="gmail-pln"> bytes maximum residency </span><span class="gmail-pun">(</span><span class="gmail-lit">681</span><span class="gmail-pln"> sample</span><span class="gmail-pun">(</span><span class="gmail-pln">s</span><span class="gmail-pun">))</span><span class="gmail-pln">
       </span><span class="gmail-lit">1</span><span class="gmail-pun">,</span><span class="gmail-lit">283</span><span class="gmail-pun">,</span><span class="gmail-lit">008</span><span class="gmail-pln"> bytes maximum slop
             </span><span class="gmail-lit">145</span><span class="gmail-pln"> MB total memory </span><span class="gmail-kwd">in</span><span class="gmail-pln"> use </span><span class="gmail-pun">(</span><span class="gmail-lit">0</span><span class="gmail-pln"> MB lost due to fragmentation</span><span class="gmail-pun">)</span><span class="gmail-pln">

                                     Tot time </span><span class="gmail-pun">(</span><span class="gmail-pln">elapsed</span><span class="gmail-pun">)</span><span class="gmail-pln">  Avg pause  Max pause
  Gen  </span><span class="gmail-lit">0</span><span class="gmail-pln">     </span><span class="gmail-lit">108716</span><span class="gmail-pln"> colls</span><span class="gmail-pun">,</span><span class="gmail-pln"> </span><span class="gmail-lit">108716</span><span class="gmail-pln"> par   </span><span class="gmail-lit">76.915</span><span class="gmail-pln">s  </span><span class="gmail-lit">20.571</span><span class="gmail-pln">s     </span><span class="gmail-lit">0.0002</span><span class="gmail-pln">s    </span><span class="gmail-lit">0.0266</span><span class="gmail-pln">s
  Gen  </span><span class="gmail-lit">1</span><span class="gmail-pln">       </span><span class="gmail-lit">681</span><span class="gmail-pln"> colls</span><span class="gmail-pun">,</span><span class="gmail-pln">   </span><span class="gmail-lit">680</span><span class="gmail-pln"> par    </span><span class="gmail-lit">0.530</span><span class="gmail-pln">s   </span><span class="gmail-lit">0.147</span><span class="gmail-pln">s     </span><span class="gmail-lit">0.0002</span><span class="gmail-pln">s    </span><span class="gmail-lit">0.0009</span><span class="gmail-pln">s

  Parallel GC work balance</span><span class="gmail-pun">:</span><span class="gmail-pln"> </span><span class="gmail-lit">14.99</span><span class="gmail-pun">%</span><span class="gmail-pln"> </span><span class="gmail-pun">(</span><span class="gmail-pln">serial </span><span class="gmail-lit">0</span><span class="gmail-pun">%,</span><span class="gmail-pln"> perfect </span><span class="gmail-lit">100</span><span class="gmail-pun">%)</span><span class="gmail-pln">

  TASKS</span><span class="gmail-pun">:</span><span class="gmail-pln"> </span><span class="gmail-lit">10</span><span class="gmail-pln"> </span><span class="gmail-pun">(</span><span class="gmail-lit">1</span><span class="gmail-pln"> bound</span><span class="gmail-pun">,</span><span class="gmail-pln"> </span><span class="gmail-lit">9</span><span class="gmail-pln"> peak workers </span><span class="gmail-pun">(</span><span class="gmail-lit">9</span><span class="gmail-pln"> total</span><span class="gmail-pun">),</span><span class="gmail-pln"> using </span><span class="gmail-pun">-</span><span class="gmail-pln">N4</span><span class="gmail-pun">)</span><span class="gmail-pln">

  SPARKS</span><span class="gmail-pun">:</span><span class="gmail-pln"> </span><span class="gmail-lit">0</span><span class="gmail-pln"> </span><span class="gmail-pun">(</span><span class="gmail-lit">0</span><span class="gmail-pln"> converted</span><span class="gmail-pun">,</span><span class="gmail-pln"> </span><span class="gmail-lit">0</span><span class="gmail-pln"> overflowed</span><span class="gmail-pun">,</span><span class="gmail-pln"> </span><span class="gmail-lit">0</span><span class="gmail-pln"> dud</span><span class="gmail-pun">,</span><span class="gmail-pln"> </span><span class="gmail-lit">0</span><span class="gmail-pln"> GC'd</span><span class="gmail-pun">,</span><span class="gmail-pln"> </span><span class="gmail-lit">0</span><span class="gmail-pln"> fizzled</span><span class="gmail-pun">)</span><span class="gmail-pln">

  INIT    time    </span><span class="gmail-lit">0.001</span><span class="gmail-pln">s  </span><span class="gmail-pun">(</span><span class="gmail-pln">  </span><span class="gmail-lit">0.007</span><span class="gmail-pln">s elapsed</span><span class="gmail-pun">)</span><span class="gmail-pln">
  MUT     time   </span><span class="gmail-lit">34.813</span><span class="gmail-pln">s  </span><span class="gmail-pun">(</span><span class="gmail-pln"> </span><span class="gmail-lit">42.938</span><span class="gmail-pln">s elapsed</span><span class="gmail-pun">)</span><span class="gmail-pln">
  GC      time   </span><span class="gmail-lit">77.445</span><span class="gmail-pln">s  </span><span class="gmail-pun">(</span><span class="gmail-pln"> </span><span class="gmail-lit">20.718</span><span class="gmail-pln">s elapsed</span><span class="gmail-pun">)</span><span class="gmail-pln">
  EXIT    time    </span><span class="gmail-lit">0.000</span><span class="gmail-pln">s  </span><span class="gmail-pun">(</span><span class="gmail-pln">  </span><span class="gmail-lit">0.010</span><span class="gmail-pln">s elapsed</span><span class="gmail-pun">)</span><span class="gmail-pln">
  Total   time  </span><span class="gmail-lit">112.260</span><span class="gmail-pln">s  </span><span class="gmail-pun">(</span><span class="gmail-pln"> </span><span class="gmail-lit">63.672</span><span class="gmail-pln">s elapsed</span><span class="gmail-pun">)</span><span class="gmail-pln">

  Alloc rate    </span><span class="gmail-lit">3</span><span class="gmail-pun">,</span><span class="gmail-lit">330</span><span class="gmail-pun">,</span><span class="gmail-lit">960</span><span class="gmail-pun">,</span><span class="gmail-lit">996</span><span class="gmail-pln"> bytes per MUT second

  Productivity  </span><span class="gmail-lit">31.0</span><span class="gmail-pun">%</span><span class="gmail-pln"> </span><span class="gmail-kwd">of</span><span class="gmail-pln"> total user</span><span class="gmail-pun">,</span><span class="gmail-pln"> </span><span class="gmail-lit">67.5</span><span class="gmail-pun">%</span><span class="gmail-pln"> </span><span class="gmail-kwd">of</span><span class="gmail-pln"> total elapsed

gc_alloc_block_sync</span><span class="gmail-pun">:</span><span class="gmail-pln"> </span><span class="gmail-lit">188614</span><span class="gmail-pln">
whitehole_spin</span><span class="gmail-pun">:</span><span class="gmail-pln"> </span><span class="gmail-lit">0</span><span class="gmail-pln">
gen</span><span class="gmail-pun">[</span><span class="gmail-lit">0</span><span class="gmail-pun">].</span><span class="gmail-pln">sync</span><span class="gmail-pun">:</span><span class="gmail-pln"> </span><span class="gmail-lit">33</span><span class="gmail-pln">
gen</span><span class="gmail-pun">[</span><span class="gmail-lit">1</span><span class="gmail-pun">].</span><span class="gmail-pln">sync</span><span class="gmail-pun">:</span><span class="gmail-pln"> </span><span class="gmail-lit">811204</span></code></pre></div><div><br></div><div>I thought about writing a plainer (non-conduit) parallel version but I was afraid of the memory issue. I tried to write a Conduit-plus-channels version but it didn't work. <br></div><div><br></div><div>Finally, I wrote a version using stm-conduit, which I thought might be a bit more efficient. It seems to be slightly better, but it's not really the kind of parallelization I was imagining:</div><div><br></div><div><span style="font-family:monospace,monospace">conduitAsyncFilterFiles :: ProjectFilter -> Path Abs Dir -> IO [String]<br>conduitAsyncFilterFiles projFilter dirname' = do<br>  (_, allFiles) <- listDirRecur dirname'<br>  buffer 10<br>    (C.yieldMany allFiles<br>    .| (C.mapMC (readFileWithPath . toFilePath)))<br>    (C.mapC (filterProjForFilename projFilter)<br>         .| C.filterC isJust<br>         .| C.mapC fromJust<br>         .| C.sinkList)</span><br></div><div><br></div><div>The first conduit passed to `buffer` does something like the following: <span style="font-family:monospace,monospace">parseStrict . B.readFile</span>.</div><div><br></div><div>This still wasn't too great, but after reading about handing garbage collection in smarter ways, I found that I could run my application like this:</div><div><pre class="gmail-lang-hs gmail-prettyprint gmail-prettyprinted"><code><span class="gmail-pln">stack exec search</span><span class="gmail-pun">-</span><span class="gmail-pln">json </span><span class="gmail-com">-- --searchPath $FILES --name hello +RTS -s -A32m -n4m</span></code></pre></div><div>And the "productivity" would shoot up quite a lot presumably because I'm doing less frequent garbage collection. My program also got a bit faster:</div><div><br></div><div><pre class="gmail-lang-hs gmail-prettyprint gmail-prettyprinted"><code><span class="gmail-pln"> </span><span class="gmail-lit">36</span><span class="gmail-pun">,</span><span class="gmail-lit">379</span><span class="gmail-pun">,</span><span class="gmail-lit">265</span><span class="gmail-pun">,</span><span class="gmail-lit">096</span><span class="gmail-pln"> bytes allocated </span><span class="gmail-kwd">in</span><span class="gmail-pln"> the heap
   </span><span class="gmail-lit">1</span><span class="gmail-pun">,</span><span class="gmail-lit">238</span><span class="gmail-pun">,</span><span class="gmail-lit">438</span><span class="gmail-pun">,</span><span class="gmail-lit">160</span><span class="gmail-pln"> bytes copied during GC
      </span><span class="gmail-lit">22</span><span class="gmail-pun">,</span><span class="gmail-lit">996</span><span class="gmail-pun">,</span><span class="gmail-lit">264</span><span class="gmail-pln"> bytes maximum residency </span><span class="gmail-pun">(</span><span class="gmail-lit">85</span><span class="gmail-pln"> sample</span><span class="gmail-pun">(</span><span class="gmail-pln">s</span><span class="gmail-pun">))</span><span class="gmail-pln">
       </span><span class="gmail-lit">3</span><span class="gmail-pun">,</span><span class="gmail-lit">834</span><span class="gmail-pun">,</span><span class="gmail-lit">152</span><span class="gmail-pln"> bytes maximum slop
             </span><span class="gmail-lit">207</span><span class="gmail-pln"> MB total memory </span><span class="gmail-kwd">in</span><span class="gmail-pln"> use </span><span class="gmail-pun">(</span><span class="gmail-lit">14</span><span class="gmail-pln"> MB lost due to fragmentation</span><span class="gmail-pun">)</span><span class="gmail-pln">

                                     Tot time </span><span class="gmail-pun">(</span><span class="gmail-pln">elapsed</span><span class="gmail-pun">)</span><span class="gmail-pln">  Avg pause  Max pause
  Gen  </span><span class="gmail-lit">0</span><span class="gmail-pln">       </span><span class="gmail-lit">211</span><span class="gmail-pln"> colls</span><span class="gmail-pun">,</span><span class="gmail-pln">   </span><span class="gmail-lit">211</span><span class="gmail-pln"> par    </span><span class="gmail-lit">1.433</span><span class="gmail-pln">s   </span><span class="gmail-lit">0.393</span><span class="gmail-pln">s     </span><span class="gmail-lit">0.0019</span><span class="gmail-pln">s    </span><span class="gmail-lit">0.0077</span><span class="gmail-pln">s
  Gen  </span><span class="gmail-lit">1</span><span class="gmail-pln">        </span><span class="gmail-lit">85</span><span class="gmail-pln"> colls</span><span class="gmail-pun">,</span><span class="gmail-pln">    </span><span class="gmail-lit">84</span><span class="gmail-pln"> par    </span><span class="gmail-lit">0.927</span><span class="gmail-pln">s   </span><span class="gmail-lit">0.256</span><span class="gmail-pln">s     </span><span class="gmail-lit">0.0030</span><span class="gmail-pln">s    </span><span class="gmail-lit">0.0067</span><span class="gmail-pln">s

  Parallel GC work balance</span><span class="gmail-pun">:</span><span class="gmail-pln"> </span><span class="gmail-lit">67.93</span><span class="gmail-pun">%</span><span class="gmail-pln"> </span><span class="gmail-pun">(</span><span class="gmail-pln">serial </span><span class="gmail-lit">0</span><span class="gmail-pun">%,</span><span class="gmail-pln"> perfect </span><span class="gmail-lit">100</span><span class="gmail-pun">%)</span><span class="gmail-pln">

  TASKS</span><span class="gmail-pun">:</span><span class="gmail-pln"> </span><span class="gmail-lit">10</span><span class="gmail-pln"> </span><span class="gmail-pun">(</span><span class="gmail-lit">1</span><span class="gmail-pln"> bound</span><span class="gmail-pun">,</span><span class="gmail-pln"> </span><span class="gmail-lit">9</span><span class="gmail-pln"> peak workers </span><span class="gmail-pun">(</span><span class="gmail-lit">9</span><span class="gmail-pln"> total</span><span class="gmail-pun">),</span><span class="gmail-pln"> using </span><span class="gmail-pun">-</span><span class="gmail-pln">N4</span><span class="gmail-pun">)</span><span class="gmail-pln">

  SPARKS</span><span class="gmail-pun">:</span><span class="gmail-pln"> </span><span class="gmail-lit">0</span><span class="gmail-pln"> </span><span class="gmail-pun">(</span><span class="gmail-lit">0</span><span class="gmail-pln"> converted</span><span class="gmail-pun">,</span><span class="gmail-pln"> </span><span class="gmail-lit">0</span><span class="gmail-pln"> overflowed</span><span class="gmail-pun">,</span><span class="gmail-pln"> </span><span class="gmail-lit">0</span><span class="gmail-pln"> dud</span><span class="gmail-pun">,</span><span class="gmail-pln"> </span><span class="gmail-lit">0</span><span class="gmail-pln"> GC'd</span><span class="gmail-pun">,</span><span class="gmail-pln"> </span><span class="gmail-lit">0</span><span class="gmail-pln"> fizzled</span><span class="gmail-pun">)</span><span class="gmail-pln">

  INIT    time    </span><span class="gmail-lit">0.001</span><span class="gmail-pln">s  </span><span class="gmail-pun">(</span><span class="gmail-pln">  </span><span class="gmail-lit">0.004</span><span class="gmail-pln">s elapsed</span><span class="gmail-pun">)</span><span class="gmail-pln">
  MUT     time   </span><span class="gmail-lit">12.636</span><span class="gmail-pln">s  </span><span class="gmail-pun">(</span><span class="gmail-pln"> </span><span class="gmail-lit">12.697</span><span class="gmail-pln">s elapsed</span><span class="gmail-pun">)</span><span class="gmail-pln">
  GC      time    </span><span class="gmail-lit">2.359</span><span class="gmail-pln">s  </span><span class="gmail-pun">(</span><span class="gmail-pln">  </span><span class="gmail-lit">0.650</span><span class="gmail-pln">s elapsed</span><span class="gmail-pun">)</span><span class="gmail-pln">
  EXIT    time   </span><span class="gmail-pun">-</span><span class="gmail-lit">0.015</span><span class="gmail-pln">s  </span><span class="gmail-pun">(</span><span class="gmail-pln">  </span><span class="gmail-lit">0.003</span><span class="gmail-pln">s elapsed</span><span class="gmail-pun">)</span><span class="gmail-pln">
  Total   time   </span><span class="gmail-lit">14.982</span><span class="gmail-pln">s  </span><span class="gmail-pun">(</span><span class="gmail-pln"> </span><span class="gmail-lit">13.354</span><span class="gmail-pln">s elapsed</span><span class="gmail-pun">)</span><span class="gmail-pln">

  Alloc rate    </span><span class="gmail-lit">2</span><span class="gmail-pun">,</span><span class="gmail-lit">878</span><span class="gmail-pun">,</span><span class="gmail-lit">972</span><span class="gmail-pun">,</span><span class="gmail-lit">840</span><span class="gmail-pln"> bytes per MUT second

  Productivity  </span><span class="gmail-lit">84.2</span><span class="gmail-pun">%</span><span class="gmail-pln"> </span><span class="gmail-kwd">of</span><span class="gmail-pln"> total user</span><span class="gmail-pun">,</span><span class="gmail-pln"> </span><span class="gmail-lit">95.1</span><span class="gmail-pun">%</span><span class="gmail-pln"> </span><span class="gmail-kwd">of</span><span class="gmail-pln"> total elapsed

gc_alloc_block_sync</span><span class="gmail-pun">:</span><span class="gmail-pln"> </span><span class="gmail-lit">9612</span><span class="gmail-pln">
whitehole_spin</span><span class="gmail-pun">:</span><span class="gmail-pln"> </span><span class="gmail-lit">0</span><span class="gmail-pln">
gen</span><span class="gmail-pun">[</span><span class="gmail-lit">0</span><span class="gmail-pun">].</span><span class="gmail-pln">sync</span><span class="gmail-pun">:</span><span class="gmail-pln"> </span><span class="gmail-lit">2044</span><span class="gmail-pln">
gen</span><span class="gmail-pun">[</span><span class="gmail-lit">1</span><span class="gmail-pun">].</span><span class="gmail-pln">sync</span><span class="gmail-pun">:</span><span class="gmail-pln"> </span><span class="gmail-lit">47704</span></code></pre></div><div><br></div><div>Thanks for reading thus far. I now have three questions.</div><div><br></div>1. I understand that my program necessarily creates tons of garbage because it parses and then throws away 5mb of JSON 500,000 times. However, I don't really understand why this helps "<code><span class="gmail-com">+RTS -A32m -n4m"</span></code> and I'm always reluctant to sprinkle in magic I don't fully understand. Can anyone help me understand what this means?<br><div><br></div><div>2. It seems that the allocation limit is really something I should be using, but I can't figure out how to successfully add it to my package.yml with the other options. From the documentation for GHC 8.2, I thought it needed to look like this but it never works, usually telling me that -A32m and -n4m are not recognizable flags (how do I add them in to my package.yml so I don't have to pass them when running the program?):<br></div><div><br></div><div><span style="font-family:monospace,monospace">ghc-options:<br>    - -threaded<br>    - -rtsopts<br>    - "-with-rtsopts=-N4 <code><span class="gmail-com">-A32m -n4m"<br></span></code></span></div><div><br></div><div>3. Finally, the most important question I have is this. When I run this program on OSX, it runs successfully through to completion. However, <i>a few minutes after terminating</i>, my terminal becomes unresponsive. I use emacs for my editor, typically launched from a terminal window and that too becomes unresponsive. This is not a typical outcome for any programs I write and it happens <i>every time</i> I run this particular application, so I know that this application is to blame. <br></div><div><br></div><div>The crazy thing is that force quitting the terminal or logging out doesn't help: I have to actually restart my computer to use the terminal application again.  Other details that may help: <br></div><ul><li>This crash happens after the process id for my program has terminated. </li><li>Watching its progress in HTOP, it never comes close to running out of memory: the value hovers in the same place.</li></ul><div>I can't really deploy an application that has this potential-crashing problem, but  I don't know to debug this issue. My total stab-in-the-dark idea is that heap allocations somehow are unrecoverable even after the process has terminated? Can anyone offer suggestions on things to look for or ways to debug and/or fix this issue? </div><div><code><span class="gmail-com"></span></code></div><div><br></div></div><div>Finally, if anyone has suggestions on better ways to structure my application or parallelize the slow parts, I'll happily take those.</div><div><br></div><div>Thanks again for reading. I appreciate any suggestions you may have.</div><div><br></div><div>Best,<br></div><div><br>-- <br><div class="gmail_signature"><div dir="ltr">Erik Aker</div></div>
</div></div>