MySQL and Friendly URLs

JerseyFoo

1/g = g-1
Reaction score
40
I've been considering how to best optimize friendly URLs, or something like "/mysql-and-friendly-urls/" instead of "/forums/showthread.php?p=949152". This system is opposed to having a simple query match for the relative URL.

DB Structure
TABLE: words
w char(9) - the word, limited to 9 characters, primary index
k smallInt - the 'key', an integer used to reference to the word
c tinyInt - count, how many resources this word points to

ex.
w | k | c
'this' | 115 | 112
'is' | 28 | 496
'a' | 31 | 337
'title' | 5049 | 8

TABLE: switch
i mediumInt - an id, primary index, no use other than being required by MySQL
k smallInt - a key from the words table, normal index
r mediumInt - resource ID

i | k | r
1029 | 115 | 6089
1030 | 115 | 4356
1031 | 115 | 87652

TABLE: rels
r mediumInt - resource ID
... - data used to reference to where and how the content is stored

---------

Handling
- URL is forwarded to PHP file
- Stripped of unsafe characters
- Exploded by "-"
- Words with lengths greater than 9 are stripped ( by MySQL probably )
- Query applied*
- If multiple resources are found, they are returned in list form with an additional "+IDOFRESOURCE" at end of URI.
- Otherwise, data delivered


*The Query
Code:
select * 
from resources l
	where exists (
		select 2
		from switch s
		where 
			s.k in ('.implode(',',$keys).')
		)
The query needs work. I'd like to expand it to grab the $keys as it goes, and have a limit based upon the lowest count found, but I'm not quite sure on how to do this i MySQL effectively.

Any input would be appreciated.
 

enouwee

Non ex transverso sed deorsum
Reaction score
240
What's this supposed to do beside kill the database? :rolleyes:

This query is going to be run on every page request, so you'd better come up with something fast and not a fancy inverse-index-like. Use a table where the first column contains the relevant part of the URI and the subsequent ones the information needed to map it to the real query string.
 

JerseyFoo

1/g = g-1
Reaction score
40
Whats faster in your opinion...

Scan potentially 50,000 tinytexts for an exact match

- OR -

Convert 5 or so char9 strings to IDs
Find a common match ( more complicated of course )

I've read MySQL's main bottleneck is the HD. One option, with a large amount of data, could save 90% and remain all static ( how large are our vocabularys anyway ).
 

enouwee

Non ex transverso sed deorsum
Reaction score
240
Whats faster in your opinion...

Scan potentially 50,000 tinytexts for an exact match

- OR -

Convert 5 or so char9 strings to IDs
Find a common match ( more complicated of course )
The first one of course. But I'd rather use a varchar with a decent index. How can you imagine that finding the 5 parts in a large dataset can be faster than hashing and locating a single line of text? At that point you only have your 5 chunks and nothing else. You still have to merge them.


I've read MySQL's main bottleneck is the HD. One option, with a large amount of data, could save 90% and remain all static ( how large are our vocabularys anyway ).
MySQL has an awful query optimizer and slows to a crawl once it has to rely on temporary tables, which are quite often used without any reason.
 

JerseyFoo

1/g = g-1
Reaction score
40
I'm still very curious and would like to measure the difference, and also this can still be used for a tag or search query. Would you have any suggestions on putting the query together?

Also, of these structures which would be better...

key | id | resource
1 | 146 | 129
2 | 146 | 628
3 | 146 | 300...

( The primary key is useless aside from MySQL needing it )

- OR -

id | resource
14600 | 129
14601 | 628
14602 | 300...

( A certain amount of space is reserved for each )
 

enouwee

Non ex transverso sed deorsum
Reaction score
240
I'm still very curious and would like to measure the difference, and also this can still be used for a tag or search query. Would you have any suggestions on putting the query together?
I wouldn't waste my time on such a thing. But, as you're curious, I'd suggest you figure that out for yourself. At least, you'll learn something from it.

Also, of these structures which would be better...
How whould I know, you're pasting different data models containing only random integers without any kind of explanation. :rolleyes:
 

JerseyFoo

1/g = g-1
Reaction score
40
Relating object IDs to other object IDs, it was related to the above. One key of a word may point to many different resources. But this requires the word key to be used multiple times, so it may not be a primary index.
 

Samuraid

Advisor
Reaction score
81
So you are saying this might occur?

1234 | 5678
1234 | 5678

If so, you cannot make those attributes into a primary key because two of the rows are duplicates of each other.

But if this occurred (with no duplicates):

1234 | 5678
1234 | 5679

Then you absolutely could make it a primary key, and if those were the only 2 attributes in the table, then it would be recommended to do so in most cases.
 

JerseyFoo

1/g = g-1
Reaction score
40
1234 | 5678
1234 | 5678 ...will not occur ( pointless to relate twice )

1234 | 5678
1234 | 5780
1235 | 5678 ...may certainly occur though

I'm just relating one object ID to another. And as far as I know, the common way to do this is

1 | 1234 | 5678
2 | 1234 | 5780

To just have an excess primary key, and a regular index on the key. The third column, the resource id, will not ever be in a WHERE query so it need not be indexed.

I was asking if its better to ( unless you know another way ) to reserve space for each key.

123400 | 5678
123401 | 5780

To query [WHERE primary key >= key*100 LIMIT (objects counted)] instead of [WHERE non-unique index key = key]. This would only work if no objects had more than 100 pointers, which can be worked around, I'm just curious as to how MySQL would handle it.

But, as you're curious, I'd suggest you figure that out for yourself. At least, you'll learn something from it.
I can't pull knowledge out of my backside, I'd have to find it somewhere, and thats why I'm here. I don't intend to copy and paste an exact query that does what I want, but just an example to show me the way. And I do know of ways to do this, however I'm looking for a way to do it all in one query.
 

enouwee

Non ex transverso sed deorsum
Reaction score
240
The system is deeply flawed and requires a far better data structure to work reliably. Image the following permalink:

the-cheaper-the-better

Since one word occurs twice, you not only have to allow it in the data model, but you must also implement ordering, otherwise you can't tell both entries "the-better-the-cheaper" and "the-cheaper-the-better" apart.

This sums up to having at least 3 fields:
1) a relation to the "word" table (i.e. foreign key)
2) a permalink identifier to group the different words together (i.e. where clause)
3) something defining the order inside a group (order clause)

And now imagine this data model competing against a simple indexed table.
 

JerseyFoo

1/g = g-1
Reaction score
40
Yes, that is to be ignored.

"word1-word2" and "word2-word1" would point to the same resource. This could also be considered a benefit with certain resources, where as words are commonly mixed up or mispelt.

If they're are separate resources with the respective names, multiple resources would be brought up, and would require the form of "word1-word2+aZqE" whereas what is after the + is a string ID. Or alternatively, the original (oldest) resource would be fetched.

Also, in support of this, thanks to its much smaller static data size, it may be cached easily in memory and should be more stable.
 

enouwee

Non ex transverso sed deorsum
Reaction score
240
Yes, that is to be ignored.
Interesting simplification. :rolleyes:

What about partial matches?

a-simple-implementation
a- simple-yet-inefficient-implementation
not-a-simple-implementation
a-simple-implementation-but-completely-broken

How are you going to cover those, if my friendly URL is "a-simple-implementation" ?


If they're are separate resources with the respective names, multiple resources would be brought up, and would require the form of "word1-word2+aZqE" whereas what is after the + is a string ID.
So, you're generating a unique ID? Why not restrict the "friendly" URL use only that ID?
 

JerseyFoo

1/g = g-1
Reaction score
40
Interesting simplification.
One of the best ways to optimize isn't it.

What about partial matches?

a-simple-implementation
a- simple-yet-inefficient-implementation
not-a-simple-implementation
a-simple-implementation-but-completely-broken

How are you going to cover those, if my friendly URL is "a-simple-implementation" ?
All 4 would be returned, in a list format as if you were searching, as if you were using a search. WordPress uses this method for example. ( not that I have any respect for WP, that server killer )

It's really not that hard, as the system I'm implementing this has templates, data setups, and etc all their own thing ( yes, it is very fast ). A multiple resource return would actually be faster, as it would not have to load the data for any resource, and just load the search template.

So, you're generating a unique ID? Why not restrict the "friendly" URL use only that ID?
The ID is to be used only where needed. When a new resource is created, or edited in which the URL is affected, it would check the table for any conflicts, and then generate a unique ID for both the new and the conflicting resource.

------
Really (believe it or not), I didn't make up this idea in a instant, I've been considering it for a month now and have read up a bit on MySQL optimization - http://dev.mysql.com/doc/refman/5.0/en/optimizing-database-structure.html , the only 'rule' it breaks found there is using multiple tables.
 

enouwee

Non ex transverso sed deorsum
Reaction score
240
One of the best ways to optimize isn't it.
Yes, getting rid of all the annoying and complicated edge cases is always the best solution. :rolleyes:

I know this first-hand from various enterprise products, which tend to be limited to simple usecases because insufficient analysis and bad implementation render the full feature set impossible to use without making the whole application irresponsive.

All 4 would be returned, in a list format as if you were searching, as if you were using a search. WordPress uses this method for example. ( not that I have any respect for WP, that server killer )
Essentially, you end up with nothing else than a search function. But even if I'm repeating myself, I'd like remind you that you're reinventing an inverted index inside a general purpose database. If you're down to such a system, why don't you switch to a search engine? It does exactly the same thing, only many times faster. Didn't you ask for an optimized version in the first place?


The ID is to be used only where needed. When a new resource is created, or edited in which the URL is affected, it would check the table for any conflicts, and then generate a unique ID for both the new and the conflicting resource.
This sounds pretty complex to me. Why not always add that unique ID and ignore the friendly stuff altogether? That's exactly what various vBulletin plugins do or major websites like the german Heise:

Whether you request:
[noparse]http://www.heise.de/newsticker/Microsoft-droht-neues-EU-Bussgeld-Bruessel-eroeffnet-Verfahren-Update--/meldung/121865[/noparse]

or:
[noparse]http://www.heise.de/newsticker/meldung/121865[/noparse]

you end up on the same page.

But of course, that'd be by far too simple and efficient to fit your needs.

EDIT:
the only 'rule' it breaks found there is using multiple tables.
And you didn't realize that this "little" detail essentially kills the whole performance because of the join overhead?
 

enouwee

Non ex transverso sed deorsum
Reaction score
240
Let's do a little experiment, I'm creating the following schema:

1) Store the "friendly stuff" as text inside a single table
Code:
mysql> desc test_links;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id    | int(32)      | NO   | PRI | 0       |       |
| text  | varchar(255) | YES  | UNI | NULL    |       |
+-------+--------------+------+-----+---------+-------+

mysql> select count(1) from test_links;
+----------+
| count(1) |
+----------+
|   200000 |
+----------+


2) Store the same data as pseudo-inverted-index inside two tables:

First, the words:
Code:
mysql> desc test_words;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id    | int(32)      | NO   | PRI | NULL    |       |
| word  | varchar(255) | NO   | UNI | NULL    |       |
+-------+--------------+------+-----+---------+-------+

mysql> select count(1) from test_words;
+----------+
| count(1) |
+----------+
|     9993 |
+----------+

Then the join table:
Code:
mysql> desc test_entries;
+---------+---------+------+-----+---------+-------+
| Field   | Type    | Null | Key | Default | Extra |
+---------+---------+------+-----+---------+-------+
| id      | int(32) | NO   | PRI | NULL    |       |
| word_id | int(32) | NO   | PRI | NULL    |       |
+---------+---------+------+-----+---------+-------+

mysql> select count(1) from test_entries;
+----------+
| count(1) |
+----------+
|  1299652 |
+----------+

mysql> select count(distinct id) from test_entries;
+--------------------+
| count(distinct id) |
+--------------------+
|             200000 |
+--------------------+


If you're interested in the sample set, it's taken from Wiktionary; I merged together the 10000 most frequent english words and generated random entries. For statistics sake, we've got a pretty decent average word-count-per-entry:
Code:
mysql> select avg(c) from ( select count(1) as c from test_entries group by id)  s;
+--------+
| avg(c) |
+--------+
| 6.4983 |
+--------+


Now, let's run some tests:
First, using the single table:
Code:
mysql> select * from test_links where text = 'says-jill-toby-possibility-fabulous-dropping-wave-sis-repeating';
+--------+-----------------------------------------------------------------+
| id     | text                                                            |
+--------+-----------------------------------------------------------------+
| 183527 | says-jill-toby-possibility-fabulous-dropping-wave-sis-repeating |
+--------+-----------------------------------------------------------------+
1 row in set (0.00 sec)

Then, using the inverted index:
Code:
mysql> select e.id
from test_entries e, test_words w
where
  e.word_id = w.id
  and word in (
    'says', 'jill', 'toby', 'possibility', 'fabulous', 'dropping', 'wave', 'sis', 'repeating'
  )
group by e.id
having count(1) = 9;
+--------+
| id     |
+--------+
| 183527 | 
+--------+
1 row in set (0.01 sec)

Somewhat slower. What does MySQL do?
Code:
+----+-------------+-------+-------+--------------------+--------------------+---------+---------+------+----------------------------------------------+
| id | select_type | table | type  | possible_keys      | key                | key_len | ref     | rows | Extra                                        |
+----+-------------+-------+-------+--------------------+--------------------+---------+---------+------+----------------------------------------------+
|  1 | SIMPLE      | w     | range | PRIMARY,word       | word               | 767     | NULL    |    9 | Using where; Using temporary; Using filesort | 
|  1 | SIMPLE      | e     | ref   | test_entries_words | test_entries_words | 4       | vb.w.id |  130 |                                              | 
+----+-------------+-------+-------+--------------------+--------------------+---------+---------+------+----------------------------------------------+
This doesn't look like one could optimize it any further, maybe tweak the indexes themselves. You still have to fetch the words using a second query.


Disk space
But you wanted to save some diskspace, right? Let's see, how much the various tables and indexes occupy (output of "du -sk", so values given in kilobytes):

Single-table approach:
Code:
du -sck test_l*
12	test_links.frm
11364	test_links.MYD
17500	test_links.MYI
[B]28876	total[/B]

Inverted index:
Code:
du -sck test_{w,e}*
12	test_words.frm
196	test_words.MYD
320	test_words.MYI
12	test_entries.frm
11424	test_entries.MYD
22672	test_entries.MYI
[B]34636	total[/B]

Interesting, isn't it? You end up using roughly 6MB more. Why is that so? Well, just look at the join table: it is huge, bigger than storing the data directly as text.

For better performance, you have to add an index to the join table, which ends up with the following disk usage:
Code:
du -sck test_{w,e}*
12	test_words.frm
196	test_words.MYD
320	test_words.MYI
12	test_entries.frm
11424	test_entries.MYD
35656	test_entries.MYI
[B]47620	total[/B]
 

JerseyFoo

1/g = g-1
Reaction score
40
The words are intended with char(9) and smallInt unsigned. May consider this specific but its the difference between static and dynamic. And on the entries table it is going to matter.

If a site happens to have more than 2^16 words where the first 9 characters are unique, this won't work, but that can be worked around. Could also drop the # of stored characters, how many words people actually use start with the same 6 letters.

Which also means the data may be cached in memory tables if resources are permitting.

Can't really go by 10's of milliseconds can we.

As I said before, make both attributes (columns) into the primary key.
How does MySQL handle results where the entire table is in the index?

And you didn't realize that this "little" detail essentially kills the whole performance because of the join overhead?
A system like vBulletin may use 20 different tables on a page request. It can't matter that much.

Essentially, you end up with nothing else than a search function. But even if I'm repeating myself, I'd like remind you that you're reinventing an inverted index inside a general purpose database. If you're down to such a system, why don't you switch to a search engine? It does exactly the same thing, only many times faster. Didn't you ask for an optimized version in the first place?
You are correct, but perhaps if the query optimizer in MySQL is terrible maybe we are overestimating it's search function.

This data is also limited, where as the search function is intended for looking for all types of matches in oceans of data, we are using limited words-only where the search data doesn't very much exceed the input data.

Thank you for the interest, Enou.

ALSO - on the words table, the word should be a primary key, and the ID with no index. Unless you know something that's wrong with that?

A main idea is to exclude any "meta data", have a separate table for control and administrative functions. I'll be setting up my own test, thank you for the queries.
 

enouwee

Non ex transverso sed deorsum
Reaction score
240
The words are intended with char(9) and smallInt unsigned. May consider this specific but its the difference between static and dynamic. And on the entries table it is going to matter.

If a site happens to have more than 2^16 words where the first 9 characters are unique, this won't work, but that can be worked around. Could also drop the # of stored characters, how many words people actually use start with the same 6 letters.
Yet more "simplification" , huh? Continue like that and there won't be anything left of whatever you were planning.


Which also means the data may be cached in memory tables if resources are permitting.

Can't really go by 10's of milliseconds can we.
Once you've got your memory upgrade, yes.

No, for a single query such response times don't matter, but with a lot of parallel ones on a busy website, a few 0.01 sec here and 0.04 sec there can decide between a premature hardware upgrade and a site running just fine on modest hardware.

How does MySQL handle results where the entire table is in the index?
Badly. In my case it was 4.78 sec vs. 0.01 sec for 200K entries querying the second index column only. with 1M, the dedicated index took 0.04 sec, I didn't bother running the other one.

A system like vBulletin may use 20 different tables on a page request. It can't matter that much.
It's not the number of tables that matters, but how you use them. Trying to fetch all words of the matching entries with a single query basically kills the database. I aborted the query after 20 secs. As soon as you "file sort" or "temporary table", you should start worrying.

You are correct, but perhaps if the query optimizer in MySQL is terrible maybe we are overestimating it's search function.
If I were to run such a system, I'd use a search engine given my current knowledge of this planned feature. But given it changes with every second post you make, it's pretty hard to estimate the requirements. If I were forced to use a RDBMS, I'd pick PostgreSQL (my data is more important than raw speed).

ALSO - on the words table, the word should be a primary key, and the ID with no index. Unless you know something that's wrong with that?
I hope you're kidding? You can't expect to quickly join two tables without the proper indexes. See my times above, they should be more than enough to convince you.

All values of both columns ("id" and "word") have to be unique independently. You achieve this with a primary key on one column (= uniqueness + index) and a unique constraint with a dedicated index on the second one.

Which is exactly what I have used.
 

JerseyFoo

1/g = g-1
Reaction score
40
I haven't changed it. I'm going by what I had originally illustrated.

first post said:
TABLE: words
w char(9) - the word, limited to 9 characters, primary index
k smallInt - the 'key', an integer used to reference to the word
c tinyInt - count, how many resources this word points to
 
General chit-chat
Help Users
  • No one is chatting at the moment.
  • Varine Varine:
    How can you tell the difference between real traffic and indexing or AI generation bots?
  • The Helper The Helper:
    The bots will show up as users online in the forum software but they do not show up in my stats tracking. I am sure there are bots in the stats but the way alot of the bots treat the site do not show up on the stats
  • Varine Varine:
    I want to build a filtration system for my 3d printer, and that shit is so much more complicated than I thought it would be
  • Varine Varine:
    Apparently ABS emits styrene particulates which can be like .2 micrometers, which idk if the VOC detectors I have can even catch that
  • Varine Varine:
    Anyway I need to get some of those sensors and two air pressure sensors installed before an after the filters, which I need to figure out how to calculate the necessary pressure for and I have yet to find anything that tells me how to actually do that, just the cfm ratings
  • Varine Varine:
    And then I have to set up an arduino board to read those sensors, which I also don't know very much about but I have a whole bunch of crash course things for that
  • Varine Varine:
    These sensors are also a lot more than I thought they would be. Like 5 to 10 each, idk why but I assumed they would be like 2 dollars
  • Varine Varine:
    Another issue I'm learning is that a lot of the air quality sensors don't work at very high ambient temperatures. I'm planning on heating this enclosure to like 60C or so, and that's the upper limit of their functionality
  • Varine Varine:
    Although I don't know if I need to actually actively heat it or just let the plate and hotend bring the ambient temp to whatever it will, but even then I need to figure out an exfiltration for hot air. I think I kind of know what to do but it's still fucking confusing
  • The Helper The Helper:
    Maybe you could find some of that information from AC tech - like how they detect freon and such
  • Varine Varine:
    That's mostly what I've been looking at
  • Varine Varine:
    I don't think I'm dealing with quite the same pressures though, at the very least its a significantly smaller system. For the time being I'm just going to put together a quick scrubby box though and hope it works good enough to not make my house toxic
  • Varine Varine:
    I mean I don't use this enough to pose any significant danger I don't think, but I would still rather not be throwing styrene all over the air
  • The Helper The Helper:
    New dessert added to recipes Southern Pecan Praline Cake https://www.thehelper.net/threads/recipe-southern-pecan-praline-cake.193555/
  • The Helper The Helper:
    Another bot invasion 493 members online most of them bots that do not show up on stats
  • Varine Varine:
    I'm looking at a solid 378 guests, but 3 members. Of which two are me and VSNES. The third is unlisted, which makes me think its a ghost.
    +1
  • The Helper The Helper:
    Some members choose invisibility mode
    +1
  • The Helper The Helper:
    I bitch about Xenforo sometimes but it really is full featured you just have to really know what you are doing to get the most out of it.
  • The Helper The Helper:
    It is just not easy to fix styles and customize but it definitely can be done
  • The Helper The Helper:
    I do know this - xenforo dropped the ball by not keeping the vbulletin reputation comments as a feature. The loss of the Reputation comments data when we switched to Xenforo really was the death knell for the site when it came to all the users that left. I know I missed it so much and I got way less interested in the site when that feature was gone and I run the site.
  • Blackveiled Blackveiled:
    People love rep, lol
    +1
  • The Helper The Helper:
    The recipe today is Sloppy Joe Casserole - one of my faves LOL https://www.thehelper.net/threads/sloppy-joe-casserole-with-manwich.193585/
  • The Helper The Helper:
    Decided to put up a healthier type recipe to mix it up - Honey Garlic Shrimp Stir-Fry https://www.thehelper.net/threads/recipe-honey-garlic-shrimp-stir-fry.193595/

      The Helper Discord

      Members online

      No members online now.

      Affiliates

      Hive Workshop NUON Dome World Editor Tutorials

      Network Sponsors

      Apex Steel Pipe - Buys and sells Steel Pipe.
      Top