Can Sequel be configured so it defer materializes?


#1

@jeremyevans this is happening cause

  1. @sgrif was not using the sequel-pg gem and I think you have correctly opted just to use your c implementation as opposed to the pg gem coder implementation that is slower. Everyone who uses sequel really should use sequel-pg. I do not think this is cheating.

  2. sequel is not defer materializing so you pay the cast cost upfront

Is there any way to configure sequel to defer materialize cause as it stands it can introduce an enormous cost for people who migrated sequel.

For context this is the corrected bench:

Raw blows both sequel and activerecord out of the park being 5x faster due to defer materialization. But once you select all the columns sequel and raw are mighty close.

I know it is a battle, choosing defer materialization just to work around users writing inefficient db access code, but over selection is an epidemic in the Rails world, it is not going away.

@bmarkons @sgrif pointed out that the benchmark RubyBench - Long Running Ruby Benchmark is incorrect cause it inserting nils into the date columns, we got to fix that asap! (also the title on that page needs love)


#2

No. Sequel does not defer typecasting. Typecasting happens at the dataset-retrieval level, not the model level.

What Sequel offers instead is the lazy_attributes plugin, which does not select the column during the query, but runs a new query on demand if the column value is actually needed (including handling of N+1 issues). This is superior in most cases, though it does require the plugin be specifically used.

It is theoretically possible to delay typecasting in adapters where the adapter does the typecasting instead of the driver (postgres when sequel_pg is not used). You could remove all of the type conversion procs, and then have the model column getter methods use the same procs for the conversion. Sequel doesn’t have an implementation for that, but it’s probably not hard. However, it’s not something I would want to support.

Basically, ActiveRecord is optimized for the case where the user is using inefficient queries, and Sequel is optimized for the case where the user is using efficient queries. Based on the average user of each, you could say that both are making the correct choice.


#3

Given that… we MUST ensure we are super clear with our benchmarks when we are over-selecting. I definitely want to keep benchmarks around for over-selection since it is a Rails epidemic.

I think every bench that “over-selects” should end with the text “over select” or something. @bmarkons

One issue with this line of thinking is that it makes it quite brutal to migrate from AR to Sequel, cause you will not be properly winning till you name every column you are selecting. It also gets particularly brutal with complex joins involving 5 tables.

I get the complexity here though. Nothing that can easily be resolved.


#4

Also, another tiny note @jeremyevans , how are you beating the c implementation of coders that the pg gem has in sequel-pg?


#5

There are certainly cases where ActiveRecord can be faster than Sequel, and selecting columns you aren’t using is one of them. I think it’s valuable for people to know that, so I would certainly recommend publishing the benchmark that shows ActiveRecord is faster than Sequel in that area, especially along with a comparison in the same graph of only selecting columns being used and/or using all columns selected.

As a general rule, I think you should only optimize application code when you need to. If something is 3x slower and not a bottleneck, no big deal. If it becomes a bottleneck, and you try to optimize, then that the fact that the optimized version is much more optimized leads to much better overall performance. It’s certainly the case that converting from ActiveRecord to Sequel can make things that weren’t bottlenecks before into bottlenecks, but addressing those specific issues will lead to much better performance. In other words, I prefer a low floor and a high ceiling to a high floor and a low ceiling.

As to why sequel_pg is faster than pg itself, I’m not sure. I wrote sequel_pg’s date parsing implementation before pg did typecasting at all, and I’ve never looked at pg’s implementation.


#6

@sam Are you sure ruby-pg decodes date/timestamp values in C? From a brief look at their repository, they appear to be using ruby-pg/text_decoder.rb at master · ged/ruby-pg · GitHub, which does everything in ruby using regexps to extract values (no wonder it is slower).

Sure enough:

$ irb -r pg
irb(main):001:0> conn = PG.connect(user: 'sequel_test')
=> #<PG::Connection:0x0011e94f96e1d8>
irb(main):002:0> conn.type_map_for_results = PG::BasicTypeMapForResults.new conn; nil
=> nil
irb(main):003:0> conn.exec('select current_date').values
=> [[#<Date: 2017-08-25 ((2457991j,0s,0n),+0s,2299161j)>]]
irb(main):004:0> PG::TextDecoder::Date.instance_method(:decode).source_location
=> ["/usr/local/lib/ruby/gems/2.4/gems/pg-0.20.0/lib/pg/text_decoder.rb", 11]
irb(main):005:0> PG::TextDecoder::Date.send(:remove_method, :decode)
=> PG::TextDecoder::Date
irb(main):006:0> conn.exec('select current_date').values
RuntimeError: no decoder function defined
        from (irb):6:in `decode'
        from (irb):6:in `values'
        from (irb):6
        from /usr/local/bin/irb:11:in `<main>'

#7

Ouch!!! That is terrible, now all the numbers make absolute sense, thanks so much for digging in to this!


#8

Thanks for pointing out! I’ll fix this benchmark.

@sam when you refer to over-selecting you mean selecting columns you don’t actually use?
I will address these benchmarks and suffix them with “over_select”.


#9

Yes, awesome about adding the suffix.

Also regarding:

You are both overselecting here and not doing this the most efficient way:

I would recommend:

  1. Adding in a couple more columns (age: int, approved: bool), both bool and int can be incredibly inefficient without a type mapper.

  2. Moving to a pattern more similar to:


# in warmup
@connection = ActiveRecord::Base.connection.raw_connection
@mapper = PG::BasicTypeMapForResults.new(@connection)

# iteration
results = @connection.exec("select * from users")
results.type_map = @mapper

i = 0

while i < result.ntuples
   str << "name: #{results.getvalue(i, 1)} email: #{results.getvalue(i, 2)}\n"
   i+=1
end

This is the most efficient way of walking the set, also be sure to create 2 raw benchmarks one with over_select and one without.


#10

:+1: Will do it.

I have created separate setup scripts to share the same between Sequel and AR.
https://github.com/ruby-bench/ruby-bench-suite/pull/104