[check_postgres] Changes to last_autovacuum check

Sat Jan 22 23:51:25 UTC 2011

All,

I was looking at last_autovacuum (and last_autoanalyze) and trying to 
improve it; the current check seems to cause a lot of false positives by 
reporting tables which haven't been updated in quite some time.  Here's 
my first stab at a query for "table has not been autovacuumed"; rather 
than depending on time, it checks for tables whose ratio of dead rows is 
more than 2X what autovacuum is supposed to tolerate:

SELECT schemaname, relname,
     GREATEST(last_vacuum, last_autovacuum) as last_vacuumed
FROM pg_stat_user_tables,
      ( SELECT setting::numeric as scalefactor
          FROM pg_settings
          WHERE name = 'autovacuum_vacuum_scale_factor' )
          as scalefactor,
      ( SELECT setting::numeric as threshold
          FROM pg_settings
          WHERE name = 'autovacuum_vacuum_threshold' )
      as threshold
WHERE n_dead_tup > threshold
      and
      ( ( n_dead_tup - threshold )  /
         (CASE WHEN n_live_tup = 0
             THEN 1 ELSE n_live_tup END ))
         >  ( scalefactor * 2 );

This obviously needs work; for example, it doesn't take into account 
tables which might have custom autovacuum settings.  The query above is 
written with the idea of a time threshold being passed by nagios as a 
warning and critical parameter, but I'm wondering if that's the best 
way.  Maybe we should report the ratio of dead rows instead, and put 
warning/critical thresholds on that?

I also think the above approach could be more accurate than the old 
bloat query for detecting bloated tables in 8.4 and later.

If this approach seems good to people, I'll work up a WIP patch to try out.

--Josh Berkus
-- 
                                   -- Josh Berkus
                                      PostgreSQL Experts Inc.
                                      http://www.pgexperts.com