using posix_fadvise pre-cache frequency data
时间:2010-08-29 来源:osdba
数据库启动的时候,数据是冷却的,在执行SQL时需要到磁盘搜索BLOCK并载入到BUFFER,这个时候的SQL响应速度比命中情况下的响应速度一般要慢10倍以上,这种一般被称为未命中的查询。
如果数据库启动的时候刚好遇到SQL执行高峰,可能应为SQL响应速度过慢导致应用被堵塞死。造成长时间的应用堵塞,恶性循环。
那么怎么更好的解决这种类似的情况,今天刚好看到了PostgreSQL上的一个很好的例子。
pgfincore,这个项目是利用posix_fadvise来对PostgreSQL数据库对象文件进行缓存,放到OS级别的缓存中(注意不是数据库buffer)。所以数据库去文件系统取BLOCK的时候实际上是在操作系统的缓存就拿到的。比直接从磁盘取要快很多。
下面是posix_fadvise的简单描述:
The posix_fadvise() function shall advise the implementation on the expected behavior of the application with respect to the data in the file associated with the open file descriptor, fd, starting at offset and continuing for len bytes. The specified range need not currently exist in the file. If len is zero, all data following offset is specified. The implementation may use this information to optimize handling of the specified data. The posix_fadvise() function shall have no effect on the semantics of other operations on the specified data, although it may affect the performance of other operations.
The advice to be applied to the data is specified by the advice parameter and may be one of the following values:
POSIX_FADV_NORMAL
Specifies that the application has no advice to give on its behavior with respect to the specified data. It is the default characteristic if no advice is given for an open file.
POSIX_FADV_SEQUENTIAL
Specifies that the application expects to access the specified data sequentially from lower offsets to higher offsets.
POSIX_FADV_RANDOM
Specifies that the application expects to access the specified data in a random order.
POSIX_FADV_WILLNEED
Specifies that the application expects to access the specified data in the near future.
POSIX_FADV_DONTNEED
Specifies that the application expects that it will not access the specified data in the near future.
POSIX_FADV_NOREUSE
Specifies that the application expects to access the specified data once and then not reuse it thereafter.
RETURN VALUE
Upon successful completion, posix_fadvise() shall return zero; otherwise, an error number shall be returned to indicate the error.
ERRORS
The posix_fadvise() function shall fail if:
[EBADF]
The fd argument is not a valid file descriptor.
[EINVAL]
The value of advice is invalid.
[ESPIPE]
The fd argument is associated with a pipe or FIFO.
下面进入主题:
1. 安装pgfincore:
1.1 将下载好的pgfincore解压后,拷贝到postgresql的源代码目录/contrib/
chown -R postgres:postgres postgresql的源代码目录
su – postgres
cd postgresql的源代码目录/contrib/pgfincore
USE_PGXS=1 make clean
USE_PGXS=1 make
1.2 su – root
. /home/postgres/.bash_profile
cd postgresql的源代码目录/contrib/pgfincore
USE_PGXS=1 make install
1.3 su – postgres
cd postgresql的安装目录/share/contrib
psql mydb superuser -f ./pgfincore.sql
2. 使用场景:
2.1 将表或者索引的数据文件加载到OS缓存:
cedric=# select * from pgfadv_willneed(‘pgbench_accounts’);
relpath | block_size | block_disk | block_free
——————–+————+————+————
base/16384/24598 | 4096 | 262144 | 111882
base/16384/24598.1 | 4096 | 55318 | 56764
(2 rows)
Time: 39309,294 ms
relpath代表文件名。
block_size代表块大小。
block_disk代表数据对象大小,使用块为单位。
block_free代表OS cache还有多少,使用块为单位。
2.2 将表或索引的在内存中的状态保存到磁盘(这里保存的应该是类似地址的东西,会在数据文件目录创建带有_mincore后缀的文件).
当数据库刚启动的时候,查询会比较慢,应为PostgreSQL和OS级别的缓存都还没有缓存需要的BLOCK。
执行下面的命令来快照和还原一个OS CACHE。
– Snapshot
cedric=# select * from pgmincore_snapshot(‘pgbench_accounts’);
relpath | block_size | block_disk | block_mem | group_mem
——————–+————+————+———–+———–
base/16384/24598 | 4096 | 262144 | 131745 | 1
base/16384/24598.1 | 4096 | 55318 | 55318 | 1
– Restore
cedric=# select * from pgfadv_willneed_snapshot(‘pgbench_accounts’);
relpath | block_size | block_disk | block_free
——————–+————+————+————
base/16384/24598 | 4096 | 262144 | 105335
base/16384/24598.1 | 4096 | 55318 | 50217
(2 rows)
Time: 38745,140 ms
The column “block_mem” report how many blocks of the file are in memory. The column “group_mem” report that all the bloks in memory are contigous (only one group).
BLOCK_MEM:对应这行的文件有多少个块已经缓存在OS CACHE中了。
GREOUP_MEM:有多少组连续内存。
3. 函数解说:
3.1 pgsysconf:
输出当前操作系统的块大小,剩余多少可用的CACHE(块)。
cedric=# select * from pgsysconf();
block_size | block_free
————+————
4096 | 417534
3.2 pgmincore
输入表名,索引名或OID
输出:
relpath : the relation path
block_size : the size of one block disk
block_disk : the total number of file system blocks of the relation
block_mem : the total number of file system blocks of the relation in page cache. (not the shared buffers from PostgreSQL but the OS cache)
group_mem : the number of groups of adjacent block_mem
cedric=# select * from pgmincore(‘pgbench_accounts’);
relpath | block_size | block_disk | block_mem | group_mem
——————–+————+————+———–+———–
base/16384/16603 | 4096 | 262144 | 0 | 0
base/16384/16603.1 | 4096 | 65726 | 0 | 0
cedric=# select * from pgbench_accounts limit 10000;
cedric=# select * from pgmincore(‘pgbench_accounts’);
relpath | block_size | block_disk | block_mem | group_mem
——————–+————+————+———–+———–
base/16384/16603 | 4096 | 262144 | 414 | 1
base/16384/16603.1 | 4096 | 65726 | 0 | 0
3.3 pgmincore_snapshot
将当前的数据库对象在OS 缓存中的状态记录下来。一遍后续的预加载操作。
cedric=# select * from pgmincore_snapshot(‘pgbench_accounts’);
relpath | block_size | block_disk | block_mem | group_mem
—————————-+————+————+———–+———–
base/16385/49240_mincore | 4096 | 262144 | 0 | 0
base/16385/49240.1_mincore | 4096 | 262144 | 238180 | 2
base/16385/49240.2_mincore | 4096 | 262144 | 56478 | 2
base/16385/49240.3_mincore | 4096 | 46902 | 0 | 0
# ls -1 /var/lib/postgresql/8.4/main/base/16385/49240*
49240
49240.1
49240.1_mincore
49240.2
49240.2_mincore
49240.3
49240.3_mincore
49240_fsm
49240_mincore
49240_vm
3.4 pgfadv_dontneed
告知系统,被标识的表的数据文件,在系统需要申请CACHE时,具有更高的优先级被刷出缓存。
cedric=# select * from pgfadv_dontneed(‘pgbench_accounts’);
relpath | block_size | block_disk | block_free
——————–+————+————+————
base/16384/24598 | 4096 | 262144 | 178743
base/16384/24598.1 | 4096 | 55318 | 234078
3.5 pgfadv_willneed
告知系统尽可能多的加载数据库对象文件到OS缓存。
cedric=# select * from pgfadv_willneed(‘pgbench_accounts’);
relpath | block_size | block_disk | block_free
——————–+————+————+————
base/16384/16603 | 4096 | 262144 | 3744
base/16384/16603.1 | 4096 | 65726 | 4236
3.6 pgfadv_willneed_snapshot
告知系统调用pgmincore_snapshot产生的快照,加载当时缓存中的BLOCK到OS缓存。
cedric=# select * from pgfadv_willneed_snapshot(‘pgbench_accounts’);
relpath | block_size | block_disk | block_free
——————–+————+————+————
base/16384/24598 | 4096 | 262144 | 105335
base/16384/24598.1 | 4096 | 55318 | 50217
剩余的这三个和刚开始对于的行为习惯一致。不多解释了。
3.7 pgfadv_normal
3.8 pgfadv_random
3.9 pgfadv_sequential
下面引用一下作者的测试报告:
cedric=# show track_disk ;
track_disk
————
on
cedric=# SELECT relname,
heap_blks_hit,heap_blks_read,heap_blks_real_read/2 as heap_blks_real_read,
idx_blks_hit,idx_blks_read,idx_blks_real_read/2 as idx_blks_real_read
from pg_statio_user_tables where relname = ‘pgbench_accounts’;
-[ RECORD 1 ]——-+—————–
relname | pgbench_accounts
heap_blks_hit | 0
heap_blks_read | 0
heap_blks_real_read | 0
idx_blks_hit | 0
idx_blks_read | 0
idx_blks_real_read | 0
./pgbench -S -t 100 -c2
tps = 43.698998 (including connections establishing)
tps = 43.785598 (excluding connections establishing)
relname | pgbench_accounts
heap_blks_hit | 0
heap_blks_read | 200
heap_blks_real_read/2| 200
idx_blks_hit | 342
idx_blks_read | 262
idx_blks_real_read/2 | 181
–
– After some runs
–
tps = 99.057743 (including connections establishing)
tps = 99.105170 (excluding connections establishing)
relname | pgbench_accounts
heap_blks_hit | 217
heap_blks_read | 19610
heap_blks_real_read/2| 17663
idx_blks_hit | 41026
idx_blks_read | 18570
idx_blks_real_read/2 | 4631
–
– After some more runs
–
tps = 143.658449 (including connections establishing)
tps = 144.511803 (excluding connections establishing)
relname | pgbench_accounts
heap_blks_hit | 442
heap_blks_read | 39838
heap_blks_real_read | 31023
idx_blks_hit | 83635
idx_blks_read | 37372
idx_blks_real_read | 7112
cedric=# select * from pgmincore(‘pgbench_accounts’);
relpath | block_size | block_disk | block_mem | group_mem
——————–+————+————+———–+———–
base/16384/24598 | 4096 | 262144 | 126140 | 15422
base/16384/24598.1 | 4096 | 55318 | 26180 | 3228
(2 rows)
cedric=# select * from pgmincore(‘pgbench_accounts_pkey’);
relpath | block_size | block_disk | block_mem | group_mem
——————+————+————+———–+———–
base/16384/24603 | 4096 | 43892 | 43825 | 29
–
– snapshot pgbench_accounts and pgbench_accounts_pkey
– restart postgresql
– flush OS cache for pgbench_accounts and pgbench_accounts_pkey
–
cedric=# select * from pgmincore_snapshot(‘pgbench_accounts’);
cedric=# select * from pgmincore_snapshot(‘pgbench_accounts_pkey’);
cedric=# select * from pgfadv_dontneed(‘pgbench_accounts’);
cedric=# select * from pgfadv_dontneed(‘pgbench_accounts_pkey’);
cedric=# select pg_stat_reset();
– run a pgbench
./pgbench -S -t 100 -c2
tps = 38.497385 (including connections establishing)
tps = 38.569719 (excluding connections establishing)
– restore buffer cache
select * from pgfadv_willneed_snapshot(‘pgbench_accounts’);
select * from pgfadv_willneed_snapshot(‘pgbench_accounts_pkey’);
– run a pgbench
./pgbench -S -t 100 -c2
tps = 169.629961 (including connections establishing)
tps = 170.889926 (excluding connections establishing)