MySQL, my nemesis

Oh boy, do I have a doozy of a rod for my back.

I’m working on a node app that scrapes a backup of a phpbb forum website, then rebuilds that site in a clean and static form. I’ve got the scraping going just fine, and it’s spitting into a database. Now comes the time to rebuild.

I was using SQLite, which worked but was (unsurprisingly) slow as heck, what with the ~300,000 posts this site had. So I’m migrating it to MySQL. And here’s where I’ve run into something of a roadblock.

My tables look like this:

Forums

name datatype description
id integer primary key
name string Forum name
description string Basic description
parentId integer ID of parent forum or NULL

Topics

name datatype description
id integer primary key
name string Topic name
dateCreated datetime When the topic was created
userId integer ID of the user that created this topic
forumId integer ID of parent forum

Posts

name datatype description
id integer primary key
subject string Post subject
datePosted datetime When the post was posted
body text body of the post
userId integer ID of the user that created this post
topicId integer ID of parent topic

Users

name datatype description
id integer primary key
name string User’s displayed name
avatar string URL of user’s avatar image

Seems simple, right? So now comes to the crunch. When building the forum index, I need some values. For instance:

And to make this, I used this hideous SQL query:

WITH
    TopicCounts as (
        SELECT COUNT(TPC.id) AS TopicCountTopics, SUM(TPC.PostCount) AS PostCount, TPC.ForumId
        FROM (
            SELECT Topics.id, Topics.ForumId, COUNT(Posts.id) AS PostCount
            FROM Topics
            LEFT JOIN Posts ON Posts.TopicId = Topics.id
            GROUP BY Topics.id
            ORDER BY Topics.ForumId
        ) AS TPC
        GROUP BY TPC.ForumId
    ),
    LatestTopics AS ( 
        SELECT Topics.name AS LatestPostName, 
            Topics.id AS LatestTopicId, 
            Users.name AS LatestUserName, 
            Posts.datePosted AS LatestPostDate, 
            Topics.ForumId
        FROM Topics
        INNER JOIN Posts ON Posts.TopicId = Topics.id
        AND Posts.id IN (
            SELECT MAX(PP.id)
            FROM Posts AS PP
            GROUP BY PP.TopicId
        )
        INNER JOIN Users ON Users.id = Posts.UserId
        ORDER BY Posts.datePosted DESC
    )
SELECT Forums.id, 
    Forums.name, 
    Forums.description, 
    Forums.parentId, 
    TopicCounts.TopicCountTopics AS TopicCount, 
    TopicCounts.PostCount AS PostCount, 
    LatestTopics.LatestPostName, 
    LatestTopics.LatestPostDate, 
    LatestTopics.LatestUserName,
    MAX(LatestTopics.LatestTopicId)
FROM Forums
LEFT JOIN TopicCounts ON TopicCounts.ForumId = Forums.id
LEFT JOIN LatestTopics ON LatestTopics.ForumId = Forums.id
GROUP BY Forums.id
ORDER BY Forums.parentId, Forums.id

Which worked fantastically in SQLite, especially with the orderings and returning the top result. MySQL, however, is not behaving, complaining that expression of select list is not in group by clause and contains nonaggregated column. And “fixing” that by telling it that’s okay isn’t helping, because the aggregated results are being returned in arbitrary order.

For consistency’s sake, I can say that IDs are sequential, so MAX(Posts.id) will always return the latest post. I also understand this is a non-trivial thing for SQL, but I’m hoping that with modern systems like MySQL 8 here, it can be made to work.

Execution time is not particularly important, as this is being run to create a static site, so it’s a one-time query, but doing all the aggregation by hand, especially with a table involving 300 thousand rows, is going to be a little daunting.

Any coders care to help? Feel free to @ me on Mastodon or find me in the LRR discord.

An update!

With some assistance from SnyperWolf, I have the following:

WITH
    TopicCounts as (
        SELECT COUNT(TPC.id) AS TopicCountTopics, SUM(TPC.PostCount) AS PostCount, TPC.ForumId
        FROM (
            SELECT Topics.id, Topics.ForumId, COUNT(Posts.id) AS PostCount
            FROM Topics
            LEFT JOIN Posts ON Posts.TopicId = Topics.id
            GROUP BY Topics.id
            ORDER BY Topics.ForumId
        ) AS TPC
        GROUP BY TPC.ForumId
    ),
    LatestTopics AS ( 
        SELECT TM.name AS LatestPostName, 
            TM.id AS LatestTopicId, 
            Users.name AS LatestUserName, 
            TM.datePosted AS LatestPostDate, 
            TM.ForumId
        FROM 
            (
                SELECT TT.name, TT.id, TT.ForumId, PM.datePosted, PM.UserId, ROW_NUMBER() OVER (PARTITION BY TT.ForumId ORDER BY PM.datePosted DESC) AS rn
                FROM Topics AS TT
                INNER JOIN Posts PM ON PM.TopicId = TT.id
                AND PM.id IN (
                    SELECT MAX(PP.id)
                    FROM Posts AS PP
                    GROUP BY PP.TopicId
                )
            ) AS TM
        INNER JOIN Users ON Users.id = TM.UserId
        WHERE rn=1
        ORDER BY TM.datePosted DESC
    )
SELECT Forums.id, 
    Forums.name, 
    Forums.description, 
    Forums.parentId, 
    TopicCounts.TopicCountTopics AS TopicCount, 
    TopicCounts.PostCount AS PostCount, 
    LatestTopics.LatestPostName, 
    LatestTopics.LatestPostDate, 
    LatestTopics.LatestUserName,
    LatestTopics.LatestTopicId
FROM Forums
LEFT JOIN TopicCounts ON TopicCounts.ForumId = Forums.id
LEFT JOIN LatestTopics ON LatestTopics.ForumId = Forums.id
ORDER BY Forums.parentId, Forums.id

Which seems to work perfectly with MySQL. And I’m not going to think about getting that same query to work with SQLite.

© 2025 Robert "Anaerin" Johnston
Powered by Hexo