MySQL and Friendly URLs

JerseyFoo · Jan 15, 2009

I've been considering how to best optimize friendly URLs, or something like "/mysql-and-friendly-urls/" instead of "/forums/showthread.php?p=949152". This system is opposed to having a simple query match for the relative URL.

DB Structure
TABLE: words
w char(9) - the word, limited to 9 characters, primary index
k smallInt - the 'key', an integer used to reference to the word
c tinyInt - count, how many resources this word points to

ex.
w | k | c
'this' | 115 | 112
'is' | 28 | 496
'a' | 31 | 337
'title' | 5049 | 8

TABLE: switch
i mediumInt - an id, primary index, no use other than being required by MySQL
k smallInt - a key from the words table, normal index
r mediumInt - resource ID

i | k | r
1029 | 115 | 6089
1030 | 115 | 4356
1031 | 115 | 87652

TABLE: rels
r mediumInt - resource ID
... - data used to reference to where and how the content is stored

---------

Handling
- URL is forwarded to PHP file
- Stripped of unsafe characters
- Exploded by "-"
- Words with lengths greater than 9 are stripped ( by MySQL probably )
- Query applied*
- If multiple resources are found, they are returned in list form with an additional "+IDOFRESOURCE" at end of URI.
- Otherwise, data delivered

*The Query

Code:

select * 
from resources l
	where exists (
		select 2
		from switch s
		where 
			s.k in ('.implode(',',$keys).')
		)

The query needs work. I'd like to expand it to grab the $keys as it goes, and have a limit based upon the lowest count found, but I'm not quite sure on how to do this i MySQL effectively.

Any input would be appreciated.

enouwee · Jan 16, 2009

What's this supposed to do beside kill the database?

This query is going to be run on every page request, so you'd better come up with something fast and not a fancy inverse-index-like. Use a table where the first column contains the relevant part of the URI and the subsequent ones the information needed to map it to the real query string.

JerseyFoo · Jan 16, 2009

Whats faster in your opinion...

Scan potentially 50,000 tinytexts for an exact match

- OR -

Convert 5 or so char9 strings to IDs
Find a common match ( more complicated of course )

I've read MySQL's main bottleneck is the HD. One option, with a large amount of data, could save 90% and remain all static ( how large are our vocabularys anyway ).

enouwee · Jan 16, 2009

JerseyFoo said:
Whats faster in your opinion...

Scan potentially 50,000 tinytexts for an exact match

- OR -

Convert 5 or so char9 strings to IDs
Find a common match ( more complicated of course )

The first one of course. But I'd rather use a varchar with a decent index. How can you imagine that finding the 5 parts in a large dataset can be faster than hashing and locating a single line of text? At that point you only have your 5 chunks and nothing else. You still have to merge them.

I've read MySQL's main bottleneck is the HD. One option, with a large amount of data, could save 90% and remain all static ( how large are our vocabularys anyway ).

MySQL has an awful query optimizer and slows to a crawl once it has to rely on temporary tables, which are quite often used without any reason.

JerseyFoo · Jan 16, 2009

I'm still very curious and would like to measure the difference, and also this can still be used for a tag or search query. Would you have any suggestions on putting the query together?

Also, of these structures which would be better...

key | id | resource
1 | 146 | 129
2 | 146 | 628
3 | 146 | 300...

( The primary key is useless aside from MySQL needing it )

- OR -

id | resource
14600 | 129
14601 | 628
14602 | 300...

( A certain amount of space is reserved for each )

enouwee · Jan 17, 2009

JerseyFoo said:
I'm still very curious and would like to measure the difference, and also this can still be used for a tag or search query. Would you have any suggestions on putting the query together?

I wouldn't waste my time on such a thing. But, as you're curious, I'd suggest you figure that out for yourself. At least, you'll learn something from it.

Also, of these structures which would be better...

How whould I know, you're pasting different data models containing only random integers without any kind of explanation.

Samuraid · Jan 17, 2009

JerseyFoo said:
id | resource
14600 | 129
14601 | 628
14602 | 300...

( A certain amount of space is reserved for each )

If both attributes are numbers, and each pair is unique, then make both of them part of the primary key.

JerseyFoo · Jan 17, 2009

Relating object IDs to other object IDs, it was related to the above. One key of a word may point to many different resources. But this requires the word key to be used multiple times, so it may not be a primary index.

Samuraid · Jan 17, 2009

So you are saying this might occur?

1234 | 5678
1234 | 5678

If so, you cannot make those attributes into a primary key because two of the rows are duplicates of each other.

But if this occurred (with no duplicates):

1234 | 5678
1234 | 5679

Then you absolutely could make it a primary key, and if those were the only 2 attributes in the table, then it would be recommended to do so in most cases.

JerseyFoo · Jan 17, 2009

1234 | 5678
1234 | 5678 ...will not occur ( pointless to relate twice )

1234 | 5678
1234 | 5780
1235 | 5678 ...may certainly occur though

I'm just relating one object ID to another. And as far as I know, the common way to do this is

1 | 1234 | 5678
2 | 1234 | 5780

To just have an excess primary key, and a regular index on the key. The third column, the resource id, will not ever be in a WHERE query so it need not be indexed.

I was asking if its better to ( unless you know another way ) to reserve space for each key.

123400 | 5678
123401 | 5780

To query [WHERE primary key >= key*100 LIMIT (objects counted)] instead of [WHERE non-unique index key = key]. This would only work if no objects had more than 100 pointers, which can be worked around, I'm just curious as to how MySQL would handle it.

But, as you're curious, I'd suggest you figure that out for yourself. At least, you'll learn something from it.

I can't pull knowledge out of my backside, I'd have to find it somewhere, and thats why I'm here. I don't intend to copy and paste an exact query that does what I want, but just an example to show me the way. And I do know of ways to do this, however I'm looking for a way to do it all in one query.

enouwee · Jan 17, 2009

The system is deeply flawed and requires a far better data structure to work reliably. Image the following permalink:

the-cheaper-the-better

Since one word occurs twice, you not only have to allow it in the data model, but you must also implement ordering, otherwise you can't tell both entries "the-better-the-cheaper" and "the-cheaper-the-better" apart.

This sums up to having at least 3 fields:
1) a relation to the "word" table (i.e. foreign key)
2) a permalink identifier to group the different words together (i.e. where clause)
3) something defining the order inside a group (order clause)

And now imagine this data model competing against a simple indexed table.

JerseyFoo · Jan 17, 2009

Yes, that is to be ignored.

"word1-word2" and "word2-word1" would point to the same resource. This could also be considered a benefit with certain resources, where as words are commonly mixed up or mispelt.

If they're are separate resources with the respective names, multiple resources would be brought up, and would require the form of "word1-word2+aZqE" whereas what is after the + is a string ID. Or alternatively, the original (oldest) resource would be fetched.

Also, in support of this, thanks to its much smaller static data size, it may be cached easily in memory and should be more stable.

enouwee · Jan 17, 2009

JerseyFoo said:
Yes, that is to be ignored.

Interesting simplification.

What about partial matches?

a-simple-implementation
a- simple-yet-inefficient-implementation
not-a-simple-implementation
a-simple-implementation-but-completely-broken

How are you going to cover those, if my friendly URL is "a-simple-implementation" ?

If they're are separate resources with the respective names, multiple resources would be brought up, and would require the form of "word1-word2+aZqE" whereas what is after the + is a string ID.

So, you're generating a unique ID? Why not restrict the "friendly" URL use only that ID?

JerseyFoo · Jan 17, 2009

Interesting simplification.

One of the best ways to optimize isn't it.

What about partial matches?

a-simple-implementation
a- simple-yet-inefficient-implementation
not-a-simple-implementation
a-simple-implementation-but-completely-broken

How are you going to cover those, if my friendly URL is "a-simple-implementation" ?

All 4 would be returned, in a list format as if you were searching, as if you were using a search. WordPress uses this method for example. ( not that I have any respect for WP, that server killer )

It's really not that hard, as the system I'm implementing this has templates, data setups, and etc all their own thing ( yes, it is very fast ). A multiple resource return would actually be faster, as it would not have to load the data for any resource, and just load the search template.

So, you're generating a unique ID? Why not restrict the "friendly" URL use only that ID?

The ID is to be used only where needed. When a new resource is created, or edited in which the URL is affected, it would check the table for any conflicts, and then generate a unique ID for both the new and the conflicting resource.

------
Really (believe it or not), I didn't make up this idea in a instant, I've been considering it for a month now and have read up a bit on MySQL optimization - http://dev.mysql.com/doc/refman/5.0/en/optimizing-database-structure.html , the only 'rule' it breaks found there is using multiple tables.

enouwee · Jan 17, 2009

JerseyFoo said:
One of the best ways to optimize isn't it.

Yes, getting rid of all the annoying and complicated edge cases is always the best solution.

I know this first-hand from various enterprise products, which tend to be limited to simple usecases because insufficient analysis and bad implementation render the full feature set impossible to use without making the whole application irresponsive.

All 4 would be returned, in a list format as if you were searching, as if you were using a search. WordPress uses this method for example. ( not that I have any respect for WP, that server killer )

Essentially, you end up with nothing else than a search function. But even if I'm repeating myself, I'd like remind you that you're reinventing an inverted index inside a general purpose database. If you're down to such a system, why don't you switch to a search engine? It does exactly the same thing, only many times faster. Didn't you ask for an optimized version in the first place?

The ID is to be used only where needed. When a new resource is created, or edited in which the URL is affected, it would check the table for any conflicts, and then generate a unique ID for both the new and the conflicting resource.

This sounds pretty complex to me. Why not always add that unique ID and ignore the friendly stuff altogether? That's exactly what various vBulletin plugins do or major websites like the german Heise:

Whether you request:
[noparse]http://www.heise.de/newsticker/Microsoft-droht-neues-EU-Bussgeld-Bruessel-eroeffnet-Verfahren-Update--/meldung/121865[/noparse]

or:
[noparse]http://www.heise.de/newsticker/meldung/121865[/noparse]

you end up on the same page.

But of course, that'd be by far too simple and efficient to fit your needs.

EDIT:

the only 'rule' it breaks found there is using multiple tables.

And you didn't realize that this "little" detail essentially kills the whole performance because of the join overhead?

Samuraid · Jan 17, 2009

JerseyFoo said:
1234 | 5678
1234 | 5780
1235 | 5678 ...may certainly occur though

That would work fine as a primary key. All 3 rows are unique.

As I said before, make both attributes (columns) into the primary key.

enouwee · Jan 18, 2009

Let's do a little experiment, I'm creating the following schema:

1) Store the "friendly stuff" as text inside a single table

Code:

mysql> desc test_links;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id    | int(32)      | NO   | PRI | 0       |       |
| text  | varchar(255) | YES  | UNI | NULL    |       |
+-------+--------------+------+-----+---------+-------+

mysql> select count(1) from test_links;
+----------+
| count(1) |
+----------+
|   200000 |
+----------+

2) Store the same data as pseudo-inverted-index inside two tables:

First, the words:

Code:

mysql> desc test_words;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id    | int(32)      | NO   | PRI | NULL    |       |
| word  | varchar(255) | NO   | UNI | NULL    |       |
+-------+--------------+------+-----+---------+-------+

mysql> select count(1) from test_words;
+----------+
| count(1) |
+----------+
|     9993 |
+----------+

Then the join table:

Code:

mysql> desc test_entries;
+---------+---------+------+-----+---------+-------+
| Field   | Type    | Null | Key | Default | Extra |
+---------+---------+------+-----+---------+-------+
| id      | int(32) | NO   | PRI | NULL    |       |
| word_id | int(32) | NO   | PRI | NULL    |       |
+---------+---------+------+-----+---------+-------+

mysql> select count(1) from test_entries;
+----------+
| count(1) |
+----------+
|  1299652 |
+----------+

mysql> select count(distinct id) from test_entries;
+--------------------+
| count(distinct id) |
+--------------------+
|             200000 |
+--------------------+

If you're interested in the sample set, it's taken from Wiktionary; I merged together the 10000 most frequent english words and generated random entries. For statistics sake, we've got a pretty decent average word-count-per-entry:

Code:

mysql> select avg(c) from ( select count(1) as c from test_entries group by id)  s;
+--------+
| avg(c) |
+--------+
| 6.4983 |
+--------+

Now, let's run some tests:
First, using the single table:

Code:

mysql> select * from test_links where text = 'says-jill-toby-possibility-fabulous-dropping-wave-sis-repeating';
+--------+-----------------------------------------------------------------+
| id     | text                                                            |
+--------+-----------------------------------------------------------------+
| 183527 | says-jill-toby-possibility-fabulous-dropping-wave-sis-repeating |
+--------+-----------------------------------------------------------------+
1 row in set (0.00 sec)

Then, using the inverted index:

Code:

mysql> select e.id
from test_entries e, test_words w
where
  e.word_id = w.id
  and word in (
    'says', 'jill', 'toby', 'possibility', 'fabulous', 'dropping', 'wave', 'sis', 'repeating'
  )
group by e.id
having count(1) = 9;
+--------+
| id     |
+--------+
| 183527 | 
+--------+
1 row in set (0.01 sec)

Somewhat slower. What does MySQL do?

Code:

+----+-------------+-------+-------+--------------------+--------------------+---------+---------+------+----------------------------------------------+
| id | select_type | table | type  | possible_keys      | key                | key_len | ref     | rows | Extra                                        |
+----+-------------+-------+-------+--------------------+--------------------+---------+---------+------+----------------------------------------------+
|  1 | SIMPLE      | w     | range | PRIMARY,word       | word               | 767     | NULL    |    9 | Using where; Using temporary; Using filesort | 
|  1 | SIMPLE      | e     | ref   | test_entries_words | test_entries_words | 4       | vb.w.id |  130 |                                              | 
+----+-------------+-------+-------+--------------------+--------------------+---------+---------+------+----------------------------------------------+

This doesn't look like one could optimize it any further, maybe tweak the indexes themselves. You still have to fetch the words using a second query.

Disk space
But you wanted to save some diskspace, right? Let's see, how much the various tables and indexes occupy (output of "du -sk", so values given in kilobytes):

Single-table approach:

Code:

du -sck test_l*
12	test_links.frm
11364	test_links.MYD
17500	test_links.MYI
[B]28876	total[/B]

Inverted index:

Code:

du -sck test_{w,e}*
12	test_words.frm
196	test_words.MYD
320	test_words.MYI
12	test_entries.frm
11424	test_entries.MYD
22672	test_entries.MYI
[B]34636	total[/B]

Interesting, isn't it? You end up using roughly 6MB more. Why is that so? Well, just look at the join table: it is huge, bigger than storing the data directly as text.

For better performance, you have to add an index to the join table, which ends up with the following disk usage:

Code:

du -sck test_{w,e}*
12	test_words.frm
196	test_words.MYD
320	test_words.MYI
12	test_entries.frm
11424	test_entries.MYD
35656	test_entries.MYI
[B]47620	total[/B]

JerseyFoo · Jan 19, 2009

The words are intended with char(9) and smallInt unsigned. May consider this specific but its the difference between static and dynamic. And on the entries table it is going to matter.

If a site happens to have more than 2^16 words where the first 9 characters are unique, this won't work, but that can be worked around. Could also drop the # of stored characters, how many words people actually use start with the same 6 letters.

Which also means the data may be cached in memory tables if resources are permitting.

Can't really go by 10's of milliseconds can we.

As I said before, make both attributes (columns) into the primary key.

How does MySQL handle results where the entire table is in the index?

And you didn't realize that this "little" detail essentially kills the whole performance because of the join overhead?

A system like vBulletin may use 20 different tables on a page request. It can't matter that much.

Essentially, you end up with nothing else than a search function. But even if I'm repeating myself, I'd like remind you that you're reinventing an inverted index inside a general purpose database. If you're down to such a system, why don't you switch to a search engine? It does exactly the same thing, only many times faster. Didn't you ask for an optimized version in the first place?

You are correct, but perhaps if the query optimizer in MySQL is terrible maybe we are overestimating it's search function.

This data is also limited, where as the search function is intended for looking for all types of matches in oceans of data, we are using limited words-only where the search data doesn't very much exceed the input data.

Thank you for the interest, Enou.

ALSO - on the words table, the word should be a primary key, and the ID with no index. Unless you know something that's wrong with that?

A main idea is to exclude any "meta data", have a separate table for control and administrative functions. I'll be setting up my own test, thank you for the queries.

enouwee · Jan 19, 2009

JerseyFoo said:
The words are intended with char(9) and smallInt unsigned. May consider this specific but its the difference between static and dynamic. And on the entries table it is going to matter.

If a site happens to have more than 2^16 words where the first 9 characters are unique, this won't work, but that can be worked around. Could also drop the # of stored characters, how many words people actually use start with the same 6 letters.

Yet more "simplification" , huh? Continue like that and there won't be anything left of whatever you were planning.

Which also means the data may be cached in memory tables if resources are permitting.

Can't really go by 10's of milliseconds can we.

Once you've got your memory upgrade, yes.

No, for a single query such response times don't matter, but with a lot of parallel ones on a busy website, a few 0.01 sec here and 0.04 sec there can decide between a premature hardware upgrade and a site running just fine on modest hardware.

How does MySQL handle results where the entire table is in the index?

Badly. In my case it was 4.78 sec vs. 0.01 sec for 200K entries querying the second index column only. with 1M, the dedicated index took 0.04 sec, I didn't bother running the other one.

A system like vBulletin may use 20 different tables on a page request. It can't matter that much.

It's not the number of tables that matters, but how you use them. Trying to fetch all words of the matching entries with a single query basically kills the database. I aborted the query after 20 secs. As soon as you "file sort" or "temporary table", you should start worrying.

You are correct, but perhaps if the query optimizer in MySQL is terrible maybe we are overestimating it's search function.

If I were to run such a system, I'd use a search engine given my current knowledge of this planned feature. But given it changes with every second post you make, it's pretty hard to estimate the requirements. If I were forced to use a RDBMS, I'd pick PostgreSQL (my data is more important than raw speed).

ALSO - on the words table, the word should be a primary key, and the ID with no index. Unless you know something that's wrong with that?

I hope you're kidding? You can't expect to quickly join two tables without the proper indexes. See my times above, they should be more than enough to convince you.

All values of both columns ("id" and "word") have to be unique independently. You achieve this with a primary key on one column (= uniqueness + index) and a unique constraint with a dedicated index on the second one.

Which is exactly what I have used.

JerseyFoo · Jan 19, 2009

I haven't changed it. I'm going by what I had originally illustrated.

first post said:
TABLE: words
w char(9) - the word, limited to 9 characters, primary index
k smallInt - the 'key', an integer used to reference to the word
c tinyInt - count, how many resources this word points to

MySQL and Friendly URLs

1/g = g-1

Non ex transverso sed deorsum

1/g = g-1

Non ex transverso sed deorsum

1/g = g-1

Non ex transverso sed deorsum

Advisor

1/g = g-1

Advisor

1/g = g-1

Non ex transverso sed deorsum

1/g = g-1

Non ex transverso sed deorsum

1/g = g-1

Non ex transverso sed deorsum

Advisor

Non ex transverso sed deorsum

1/g = g-1

Non ex transverso sed deorsum

1/g = g-1

Log in

The Helper Discord

Share this page

Affiliates

Network Sponsors