MySQL中字符串查询效率大比拼

2023-06-01 14:15 由程序员Forlan 发表于 #数据库

背景

最近有个同事对字符串加索引，加完后，发现多了个奇奇怪怪的数字
执行的SQL如下：

alter table string_index_test add index `idx_name` (`name`) USING BTREE;

这个奇怪数字就是191，它很是疑惑，也没指定索引的长度

通过查看MySQL官方文档

InnoDB has a maximum index length of 767 bytes for tables that use COMPACT or REDUNDANT row format, so for utf8mb3 or utf8mb4 columns, you can index a maximum of 255 or 191 characters, respectively. If you currently have utf8mb3 columns with indexes longer than 191 characters, you must index a smaller number of characters.

In an InnoDB table that uses COMPACT or REDUNDANT row format, these column and index definitions are legal:
col1 VARCHAR(500) CHARACTER SET utf8, INDEX (col1(255))

To use utf8mb4 instead, the index must be smaller:
col1 VARCHAR(500) CHARACTER SET utf8mb4, INDEX (col1(191))

大概意思就是InnoDB最大索引长度为 767 字节数，用的编码是utf8mb4，则可以存储191个字符（767/4 约等于 191），编码字段长度超出最大索引长度后MySQL 默认在普通索引追加了191

思考

1、MySQL中如何提高字符串查询效率？

对字符串加索引？
一般情况下，是不建议在字符串加索引，占空间
如果一定要加，建议可以指定长度，前提是字符串前面部分区分度好的话，此时这类索引就叫前缀索引

2、前缀索引有什么问题？

区分度不好的话，很容易发生碰撞，进而引发一系列问题
我们再通过执行计划来分析一波

上面分别演示了前缀索引和普通索引在只有where条件、order by和group by不同执行情况，可以看到Extra的说明，前缀索引只有where条件，无法使用覆盖索引，order by会使用filesort，group by会使用temporary和filesort
总的来说，前缀索引无法使用覆盖索引，进而导致order by和group by要使用文件排序，甚至临时表
前缀索引有这么些问题，不指定长度？怎么处理？

分析

准备了单表100W的数据进行测试
使用性能压力测试工具mysqlslap
性能测试脚本

mysqlslap -uroot -p --concurrency=100,200 --iterations=1 --number-of-queries=1 --create-schema=test --query=C:\xxx\query.sql

–concurrency=100,200 测试并发的线程数/客户端数，第一次100，第二次200
–iterations=1 指定测试重复次数1次
–number-of-queries=1 指定每个线程执行的 SQL 语句数量上限（不精确）
–create-schema=test 指定查询的数据库test

1、不加索引
查询的SQL：SELECT SQL_NO_CACHE * FROM string_index_test WHERE name=‘forlan’;

Benchmark
        Average number of seconds to run all queries: 8.328 seconds
        Minimum number of seconds to run all queries: 8.328 seconds
        Maximum number of seconds to run all queries: 8.328 seconds
        Number of clients running queries: 100
        Average number of queries per client: 0

Benchmark
        Average number of seconds to run all queries: 18.078 seconds
        Minimum number of seconds to run all queries: 18.078 seconds
        Maximum number of seconds to run all queries: 18.078 seconds
        Number of clients running queries: 200
        Average number of queries per client: 0

2、加字符串索引
alter table string_index_test add index idx_name (name) USING BTREE;
查询的SQL：SELECT SQL_NO_CACHE * FROM string_index_test WHERE name=‘forlan’;

Benchmark
        Average number of seconds to run all queries: 0.250 seconds
        Minimum number of seconds to run all queries: 0.250 seconds
        Maximum number of seconds to run all queries: 0.250 seconds
        Number of clients running queries: 100
        Average number of queries per client: 0

Benchmark
        Average number of seconds to run all queries: 1.438 seconds
        Minimum number of seconds to run all queries: 1.438 seconds
        Maximum number of seconds to run all queries: 1.438 seconds
        Number of clients running queries: 200
        Average number of queries per client: 0

3、使用CRC32创建索引

CRC全称为Cyclic Redundancy Check，又叫循环冗余校验。
CRC32是CRC算法的一种，返回值的范围0~2^32-1，使用bigint存储

加一个name_crc32列，创建这个列的所有，索引空间小很多，利用整型加速查询
加索引：alter table string_index_test add index idx_nam_crc32 (name_crc32) USING BTREE;
查询的SQL：SELECT SQL_NO_CACHE * FROM string_index_test WHERE name_crc32=CRC32(‘forlan’) and name=‘forlan’;
因为CRC32存在发生碰撞，所以加上name条件，才能筛选出正确的数据

Benchmark
        Average number of seconds to run all queries: 0.266 seconds
        Minimum number of seconds to run all queries: 0.266 seconds
        Maximum number of seconds to run all queries: 0.266 seconds
        Number of clients running queries: 100
        Average number of queries per client: 0

Benchmark
        Average number of seconds to run all queries: 0.390 seconds
        Minimum number of seconds to run all queries: 0.390 seconds
        Maximum number of seconds to run all queries: 0.390 seconds
        Number of clients running queries: 200
        Average number of queries per client: 0

总结

通过对字符串加索引，可以提高查询效率，但需要注意指定长度，无法使用覆盖索引
通过使用CRC32，需要额外存一个字段，将字符串转为整数存储，节省空间，效率提升并不是很大，但存在碰撞问题，可以加多字符串筛选条件
-对于CRC32存在碰撞问题，可以使用CRC64减少碰撞，但需要安装 common_schema database函数库