Any MySQL wizards here?

timter51 · May 17, 2017

I've got a big ol' MySQL database here, one table with over 1,000,000 rows in a single column, each row contains a single sentence. Is there a query I can run to bring back a list of the most common words in the entire table?

martin-s · May 17, 2017

Not that I can think of. You'd be better off pulling results into php and dealing with it there.

Or depending on the reasoning for wanting to do it, download a csv backup and copy pasting it all into a word cloud site?

timter51 · May 17, 2017

Hmmm OK... we're talking manyt millions of individual words here so I think PHP might buckle!

Lovekraft · May 17, 2017

It's been a while since i used MySQL but would something like this work?

SELECT ColumnName, COUNT(ColumnName) AS ColumnName_Count
FROM table_name
GROUP BY ColumnName
ORDER BY ColumnName_Count DESC;

RobM · May 17, 2017

Loop through the rows - explode the sentence into words - increment an associative array of the words. At the end you'll have an array of all the words and their count.

timter51 · May 17, 2017

Lovekraft said:
It's been a while since i used MySQL but would something like this work?

SELECT ColumnName, COUNT(ColumnName) AS ColumnName_Count
FROM table_name
GROUP BY ColumnName
ORDER BY ColumnName_Count DESC;

I think that will only count exact record match? I don't think it'll help counting words in sentences.

RobM said:
Loop through the rows - explode the sentence into words - increment an associative array of the words. At the end you'll have an array of all the words and their count.

Thanks Rob. If anyone can knock that up for me for a reasonable fee please let me know!

RobM · May 17, 2017

Something like below: note you have to connect to your database and also this example only separates words by a space character.

<?
$query="SELECT column FROM table;";
$result=mysql_query($query);
while ($row=mysql_fetch_assoc($result))
{
$words=explode(" ",strtolower($row['column']));
for ($i=0;$i<count($words);$i++)
{
$array[$words[$i]]++;
}
}
print_r($array);
?>

timter51 · May 17, 2017

Lovely stuff Rob, sure I can get the database connection stuff done

RobM · May 17, 2017

If you don't care how the output is formatted I'd do a string replace to delete everything in the row that's not alphanumeric - that takes care of punctuation

timter51 · May 17, 2017

Getting there I think, but it's throwing up errors for every word like:

Notice: Undefined index: hotel in C:\wamp64\www\mysqchk\check2.php on line 26

Line 26 is:

$array[$words[$i]]++;

Anything obvious that I'm missing?

RobM · May 17, 2017

Change that line to:
if (isset($array[$words[$i]])) $array[$words[$i]]++; else $array[$words[$i]]=1;

I think it's only a warning anyway - the code would still function but this is safe

RobM · May 17, 2017

Also if you put arsort($array); before your print it will sort the array into descending order of amounts - ie most common word first

timter51 · May 17, 2017

So close I can smell it, I'm getting a memory error:

Allowed memory size of 134217728 bytes exhausted

I'm guessing because my database is so bloody big, there must be something like 5million+ words in there. I'll raise the PHP memory limit and see what happens.

RobM · May 17, 2017

Yeah increase the limit in your php.ini file or probably better to use a loop in your SELECT and a LIMIT to get a smaller amount - eg groups of 50k if the memory error is coming from selecting all records. However this will make no difference if the memory error is from the array.

martin-s · May 17, 2017

You could also ignore a list of stop words. Eg:
http://www.ranks.nl/stopwords

mrt · May 17, 2017

It's feasible to do this within the database itself rather than pulling a million rows into PHP - and dealing with the memory usage that entails!

I found this which almost does what you want: https://dba.stackexchange.com/a/82456 - I modified it slightly to add counts and order by most frequent words, leaving us with this query:

Code:

SELECT
  COUNT(*) as num,
  SUBSTRING_INDEX(SUBSTRING_INDEX(logs.useragent, ' ', numbers.n), ' ', -1) logs
FROM
  (SELECT 1 n UNION ALL SELECT 2
  UNION ALL SELECT 3 UNION ALL SELECT 4) numbers INNER JOIN logs
  ON CHAR_LENGTH(logs.useragent)-CHAR_LENGTH(REPLACE(logs.useragent, ' ', '')) >= numbers.n-1
GROUP BY logs
ORDER BY num DESC
LIMIT 20;

I tested it on a table I have with ~65,000 rows, the table name is 'logs' and the column I was searching is 'useragent' - replace these with the appropriate ones for your table. If you need to replace punctuation I'm sure that can be done too. The output for me looked something like this:

Code:

+-------+-----------------------------------------+
| num   | logs                                    |
+-------+-----------------------------------------+
| 38314 | Mozilla/5.0                             |
| 17251 | (compatible;                            |
| 12233 | NT                                      |
| 12211 | (Windows                                |
|  7452 | MSIE                                    |
|  6167 | 6.1;                                    |
|  5831 | U;                                      |
|  5718 | Mozilla/4.0                             |
|  5379 | Linux                                   |
|  5367 | (X11;                                   |
|  4365 | Windows                                 |
|  4355 | (Windows;                               |
|  4014 | Googlebot/2.1;                          |
|  4007 | +http://www.google.com/bot.html)        |
|  3822 | x86_64)                                 |
|  3581 | Stratagems                              |
|  3581 | Kumo                                    |
|  3205 |                                         |
|  2821 | http://www.nominet.org.uk/privacypolicy |
|  2803 | (Macintosh;                             |
+-------+-----------------------------------------+

That took just over 2 seconds to run on the ~65,000 rows which take up about 10MB of space.

It'll be interesting to see how this scales to a 1M row table!

timter51 · May 18, 2017

I fear I may have to admit defeat here... PHP still laughed at me when I raised to 1GB memory, and phpmyadmin/MySQL crapped out after 10 minutes running the query above (it just returned to a blank query input screen).

I think the only way I can make this happen is to:

1. Have a seperate counting table (one column for words, one for the count) to input new words and +1 the count for words already there
2. Have a new column in my data table where I can flag that row has already been parsed before. This way I can keep running the script, and it'll skip rows that have already been counted and entered into the counting table

Makes sense in my head, but it's beyond my PHP skills.

RobM · May 18, 2017

Did you put the select in a loop? If that is the cause of your memory problem it will fix it. But if the array is using up your memory you can always have another table with each word in it and a count for it - no array memory. If you want to PM me any relevant info I'll sort it for you.