Project PostgreSQL Hyper-record

sam · June 11, 2018, 7:52am

Following: An analysis of memory bloat in Active Record 5.2 Discourse is looking to sponsor some work to heavily improve AR performance with Postgres in Rails 6.0.

The concept here is to add a series of “back-portable” patches to Rails 5.2 that improve performance around all sorts of areas of pain. Discourse will run all these patches in production far in advance of the Rails 6.0 release however the plan is to make sure we land all the patches for Rails 6.0.

Here is a breakdown of changes we would like to investigate from “easy” to “hard”

Native date/time casting

PG library recently added native C level date/time casting, this patch would amend AR internal type mapping code to make use of this and ensure it is default for 6.0

It would also be shipped as a possible monkey patch / extension for 5.2 that early adopters.

The patch will include benches that demonstrate what a big difference this makes (selection / plucking of date from a table)

There are very little risks here, just need to make sure that we consult with pixeltrix (Andrew White) · GitHub to ensure there is proper full parity with the PG code.

GOAL

create table things (
   created_at timestamp without timezone
)

1000.times do { Thing.create(created_at: Time.now)}

# This is significantly faster with this patch in place (expecting 2x faster) 
Thing.pluck(:created_at)

PG specific ActiveRecord::Result

At the moment all AR Postgres results are “adapted” via a generic ActiveRecord::Result object.

This object is not PG aware and materializes all data prior to passing it along. This causes reasonable levels of internal waste especially slowing stuff like pluck down a lot.

By writing a specific implementation here we pave the way to other internal improvements.

gist.github.com

https://gist.github.com/SamSaffron/f32dbd74b85e43f9a9d8f93f2c6c5e60

pg_result.rb

# frozen_string_literal: true
# tiny rerunner (I use this on local to rerun script)
require_relative '../rerunner/rerunner.rb'

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'pg'
  gem 'activesupport'#, path: '../../rails/activesupport'

This file has been truncated. show original

We can see that a careful “rewrite” here can provide us with 2x performance for pluck.

There are some missing bits in the PG gem that will help out here:

Ability to get entire row as “values” (instead of N calls to getvalue)
Ability to defer cast a row and get a hash like objects that is already materialized

Another challenge is that the interface for ActiveRecord result is rather wide and “special” for example .empty? will call #rows which is somewhat odd, why do we even need this?

Overall this patch is straight forward and of minimal internal changes with enormous benefit to “pluck” and some improvement overall depending on support the PG gem gives us.

Special case AttributeSet and LazyAttributeHash for PG

This is a huge performance win:

class ActiveModel::AttributeSet::Builder
  def build_from_database(values = {}, additional_types = {})
    FakeSet.new values
  end

  switch_build_from_database :new

  class FakeSet
    def initialize(values)
      @values = values
    end

    def fetch_value(key)
      if block_given?
        @values.fetch(key.to_s) { yield }
      else
        @values[key.to_s]
      end
    end
  end
end

this trivial monkey patch can give us a very big perf bump, looking at about 15-20% percent improvement for Topic.select(:id, :title).limit(10)

It does so by reducing internal object allocations that are very expensive.

The tricky part is allowing for all the internal AR edge cases with magic custom attributes on a column and default values without compromising internal designs.

Opt-in lazy materialization

If we can couple PG::Result to models we can be 100% delayed on materialization of columns and type casting, this means that if you over select we only do very minimal amount of work.

The tricky thing is that for pathological cases this can lead to memory bloat:

$long_term = []
Topic.limit(1000).each do |t|
   $long_term << t if t.id == 100 # only one topic is retained, but entire PG::Result will remain in memory
end

Though this is probably an edge case for 99.9% of apps out there, AR can not ship with an unsafe default so we would have to have this mode be an “opt-in” thing.

Lazy casting out-of-the-box (for people who do not opt in to lazy materialization)

The last piece that requires changes from the PG gem is lazy casting for all, which would mean an API like this would be exposed.

result.tuple_values(7)
# [1,2, 'abc']

result.tuple_values(7, :lazy_cast) # or result.lazy_tuples_values(7) 
# same array like object except that it is "lazy" in that [1] will be the first time [1] is cast via type mapper
# Object will hold a single cstring with all the data in the raw form copied from the pg_result.

sam · June 11, 2018, 9:59pm

Lars just sent through a PR for tuple values

sam · June 12, 2018, 11:29am

Lars,

One point Matthew raised to me that is very much worth noting is: PostgreSQL: Custom allocators in libpq

libpq is pretty bad with the way it plays with Ruby malloc accounting, so you can have a whole bunch of memory allocated by it but not be aware of it. In turn this can lead to pretty serious memory bloat if we do not clear early.

Something like:

result.tuple_values(7, :lazy_cast)

and

result.values(:lazy)

can work around this issue by making proper malloced copies which Ruby can properly account for.

A second option is having the PG gem have some special interface into Ruby where it informs it how many bytes are allocated to it, that seems like a more complex problem to solve.

larskanis · June 12, 2018, 3:35pm

A second option is having the PG gem have some special interface into Ruby where it informs it how many bytes are allocated to it

The PG gem could use RTypedData which has a callback for object size calculation. But this value is only used for ObjectSpace.memsize_of(), not for triggering the GC (because it would count memory twice when it’s allocated by xmalloc). Also libpq doesn’t offer an interface to retrieve the allocated result size. I already did some investigation by accessing the internal PGresult structure, but it turned out that there’s no reliable way to reassemble the malloc’d memory size.

Tenderloves proposal would solve this issue, but it seams to be discontinued.

In turn this can lead to pretty serious memory bloat if we do not clear early.

Did you do any tests which resulted in memory bloat? I changed a spree shop in production to not clear any PG::Result objects. It didn’t take measurable more memory after one hour. I guess rails allocates far more memory than libpq, so that the GC is called anyway and the PG::Result objects are regular freed, in a common rails application.

Nevertheless adding lazy versions to array returning PG::Result methods should be possible. How lazy should PG::Result#values be, since it returns Array<Array<String>> ?

sam · June 13, 2018, 6:51am

No worries, I added a bit of background here, with a test case that explodes together with a bunch of ideas:

It really heavily depends on use cases here, we have non-cleared PG::Results floating around in Discourse and usually we are OK, but sometimes in a few out of hundreds of sites we can see unicorn bloat to 1-1.5gb RSS, which is highly likely due to the anti-pattern. The big problem is if these objects sneak in to the old generation which tends to happen with heavy multithreaded use like sidekiq or a busy puma.

I would say extremely lazy.

So, initially you would just copy the memory of the result into a char*[cols*rows] and track column count and row count ints and initialize an RVALUE*[cols*rows] to all nulls

As you “materialize” stuff you would free a char* and update the RVALUE*[cols*rows] in the wrapped type and be careful to rb_gc_mark so Ruby does not remove them under you.

When looking up a row/column you would first check the RVALUE array if there is a NULL there you would upgrade the char* to an RVALUE when needed and delete strings returning the RVALUE.

If you “peel off a row”, you would have a very similar structure per/row which would be initialized with char*[cols] and RVALUE*[cols].

So the flow would be

result = c.exec('select * from table')
values = result.lazy_values
# 1 RVALUE is created for wrapping object

row = values[1] 
# 1 RVALUE created for row wrapping object (reusable)

data = row[17]
# 1 new RVALUE is created

# if we wish to avoid wrapping row objects we can

col_vals = values.column_values(0) # returns a single array

larskanis · June 14, 2018, 7:23pm

OK, I did a serious look at the PG gem about how detached lazy rows could be implemented.

A primary issue is, that the current typemap implementation doesn’t fit to the lazy paradigm. If a PG typemap gets a request to cast a field, it reads the value from a libpq result object, picks the previously defined decoder and calls the decoder with the field pointer as a parameter. The resulting object is returned through the typemap to the caller.

But a detached row shall be decoupled from the result object. So we have to split typemaps and decoders apart, so that the typemaps run as part of tuple_values to fetch the values from the result and to determine the decoder, without executing it, but storing all these data into the PG::LazyTuple internal memory. When the tuple value is materialized later on, only the decoder is executed.

Splitting typemaps from decoders needs some refactorization in PG. I have to take special care not to slow down the well known immediate value retrieval methods.

We can avoid copying BLOBs twice, by using a threshold for lazy casts. Fields with a length bigger than some thousands bytes, could be directly casted as a T_STRING and stored as materialized field from the start. In this case the costs for creation of a ruby string is insignificant compared to copying the field value, but copying twice when the field is accessed costs CPU and wastes memory.

A comparably small but valuable extension is, to add another class: something like PG::VeryLazyTuple. This would keep a reference to the PG::Result object, a tuple number and materialized field values. In this case we can execute both typemap and decoder in a lazy manner. We could use these both API compatible object types to easy compare copying and non-copying lazy implementations. And we can easily switch from one to another implementation, if the GC/memsize issue is solved.

sam · June 15, 2018, 12:42am

I really like all of this! don’t have much to add except that I really like it a lot.

larskanis · June 29, 2018, 6:12pm

I added a first version of lazy tuple_values. It is not yet the copying version, but it keeps a reference to the PG::Result object. The interface of both implementations however should be equal.

I’m not sure about the best API to PG::Result. tuple_values_very_lazy is just to avoid additional code changes for now.

sam · June 30, 2018, 10:55am

Awesome!!!

I think API wise maybe just keeping the behavior on PG::Result is ideal


result.detach_tuple_values = true

result.lazy_cast = true

result.tuple_values(5)

OR


result.tuple_values_options(:detach, :lazy_cast)

result.tuple_values(3)

I don’t think one would want to mix and match here so just setting the behaviour at the result level seems like the cleanest way to avoid needing too many API methods.

larskanis · July 2, 2018, 3:38pm

So, I played a bit with ActiveRecord and the PG specific Result object on this branch:

It passes AR’s rake test:postgresql in both modes (with and w/o PG specific Result). Benchmark is here: pg_result_lars.rb · GitHub

Some changes to AR are to avoid wasteful computations like map(&:first), so that only really necessary information are retrieved. This gives a little speedup for database agnostic Result as well.

It also adds a PG specific Result, which gives some noticeable speedup for wasteful queries. What I dislike is, that it adds even more complexity to the process. And AR instantiation is already pretty complex.

A big bottleneck when retrieving many values is that every single value currently goes through type.deserialize(value). Removing this could give a 2x speedup for pluck, as you showed here: pg_result.rb · GitHub . So an idea is to move all deserialization and default value assignment behind PG’s typemaps. This would obsolete LazyAttributeHash since PG::LazyTuple is already lazy. And we could remove additional value deserialization and serialization, since all values come out of the Result as properly serialized value already or will get serialized through PG internally (for query parameters).

sam · July 2, 2018, 11:32pm

I personally love the idea of only doing type mapping in the type mapper and avoiding the “double type mapping” pattern Rails now has.

I think we should push the PR forward as a “step 1” kind of thing, even though we need more steps longer term.

My only bit of feedback though is that the very_lazy_tuples kind of syntax is somewhat confusing, and the more I think about it the more I like result.tuple_values_options(:detach, :lazy_cast)

Topic		Replies	Views
Can we fill up our benchmarks?	12	2344	September 7, 2017
Discourse Releases benchmarks bumped to Discourse stable 1.2.4	6	2777	April 23, 2017
How can active record be faster?	14	2289	July 24, 2017
Rails Micro Benchmarks!	11	3538	June 2, 2015
Discourse Categories Benchmark	0	1843	February 16, 2015