MidJourney V4 review

The MidJourney team has constantly been improving their model to further the development of text-to-image AI systems. I started back when they were on V2 of their algorithm and previously wrote a review comparing V2 to V3 after V3 was launched. V4 brings new characteristics to MJ and has changed the game a bit.

The team teased us with V4 for awhile by talking about the development and also giving us the ability to rank images before the model was released. What we saw was a big change in the realism in the renders and accuracy. Keeping in step with how they’ve released the models, it just happened suddenly with very little warning. Fortunately I was able to interact with V4 a lot the first day and into the weekend.

One note: for this review I will *not* be comparing V4 to TEST/TESTP. TEST/TESTP are based on Stable Diffusion and brings with them all the issues SD has. While the models were good, I still felt there were limited and not fully expressive like MJ has the capability to be. For this review, I will be comparing V4 to both V2 and V3.

First off, V4 is incredibly impressive even though it is still considered being in Alpha. The renders are very realistic and show big improvement in what they can deliver. The team is finally closing the gap with Dall-E in terms of photorealism and prompt accuracy but still brings over the MJ aesthetic we have grown to love. Another benefit to V4 is that it produces the traditional 1024×1024 images in a 2×2 grid whereas TEST/TESTP only provided a 1×1 grid and used up double the GPU hours. It’s nice to have that back as I’ve started to be more frugal in my hour usage.

To start the comparison, I will be using the same prompt with a consistent seed and the default options for all the algorithms. Below is the V3 2×2 grid render with prompt details:

/imagine a vintage 1900s photograph of a grotesque monster –seed 0420 –v 3

We’ll use the above grid as the baseline as we compare with V4. Using the same prompt and seed, below is the V4 grid output:

/imagine a vintage 1900s photograph of a grotesque monster –seed 0420 –v 4

For one more baseline, below is the V2 output of the same prompt:

/imagine a vintage 1900s photograph of a grotesque monster –seed 0420 –v 2

Immediately the differences stand out. For one, the MJ team has mentioned that V4 is a all new and doesn’t borrow from V2 or V3. Given how vastly different the initial grids are, I agree with the team that this is all new. Also, the V2 grid is somewhat similar to the V3 grid which further supports this. While both have rendered monsters, the V4 ones are in a portrait style where the V3 ones are a mix of portrait and full body. V4 also looks more “clean” than V3/V2. While I brought this up in my previous review of V3 removing some of the dirt that V2 had, V4 seems like even a future departure from this. Since we’re still in alpha with V4, we’ll have to wait and see if there will be options to adjust the output like we were able to in V3.

When it comes to the upscaling, the alpha qualities of V4 are more apparent. The upscale renders don’t seem as deep and the image quality appears far to bright. Running the renders through the beta upscale helps clean this up a bit, but hopefully the quality gets further refined as they make improvements.

V3 U2 upscale
V4 U2 Upscale
V4 U2 Beta Upscale

V4 is a huge step forward for MidJourney as they become more realistic, accurate, and improve their model.

Cultures surrounding the AIs

I’ve been very fortunate to be involved early on with 3 of the latest AI text-to-image art systems. First was MidJourney, second was Dall-e 2, and now I’ve been able to get in on early access for Stable Diffusion. From a tech perspective, each of these generate different styles of images and have their various strengths and weaknesses. What is more curious is to observe the cultures each of these tools has surrounded themselves with.

MidJourney

When I got on MidJourney, there was a sense of exploration amongst ourselves. We seemed to all be getting in on something new and unique and we were all seemingly working together to explore this new tool. MidJourney attracted enthusiasts who wanted to learn and explore this new tool, together. There was a lot of collaboration and openness as we experimented with different prompts and getting what we wanted out of the system.

Dall-E 2

When I first got access to Dall-e and started to get into the surrounding unofficial communities, there was a striking different tone compared to the folks who were using MJ. From the restricted access, there was a lot more scams showing up where people were taking advantage of this lack of access. People were charging money to run prompts, they were charging money for invites(that didn’t exist), and there was a much bigger sense of trying to use Dall-e for commercial purposes.

Stable Diffusion

SD is the epitome of tech bro culture. Everything surrounding their release was all hype. Folks who had requested beta access were granted the ability to get in on their Discord server, but we still had to wait over 24 hours before the bot even came online. During this time, the mods, founder, and server staff continually teased us with SD’s generations and kept on building the hype. It seemed like they were focused more on growing a user fan base first instead of making sure the product was refined before launch. The founder’s messaging about bringing in “influencers” and making statements about how “well funded” they are further exemplified this opinion.

MidJourney Version3 Comparison and Review

For the past two days I have been exploring and testing the new Version3 generational algorithm from MidJourney. This one’s goal is to improve accuracy of output and a new upscaler to remove distortions and artifacts. My initial experiments showed the potential of this new algorithm but it seemed to take away some of the “dirtiness” I came to like when beginning to work with MidJourney.

I decided to perform some experiments comparing the Version2 and Version3 algorithms and see how the Version3 can be tweaked to regain some of the “dirt” the V2 algorithm had that made it so special.

To start, I generated the inside of an abandoned shopping mall taken on a Polaroid. MidJourney V2 did a really good job of capturing that grit and small distortions that capture the aesthetic Polaroid film would have.

/imagine a polaroid of the inside of an abandoned mall –seed 0420 –v 2

This image will serve as the reference point I will try and recreate with the V3 algorithm. By setting a seed value, the various generations will be somewhat similar.

/imagine a polaroid of the inside of an abandoned mall –seed 0420 –v 3

To start, I created a V3 generation using all the default settings. This is the most “neutral” of the V3 options in stylizing and quality. Comparing the two, it appears the V3 one is more “clean” in appearance as some in some cases the photo doesn’t look like a Polaroid at all.

/imagine a polaroid of the inside of an abandoned mall –seed 0420 –v 3 –stylize 5000

Bumping up the stylizing gives it a way cleaner look. There’s hardly any distortions in the sample renders and the subject of the images themselves appear too straight and perfect. This has started to drift away from an abandoned mall.

/imagine a polaroid of the inside of an abandoned mall –seed 0420 –v 3 –stylize 20000

Pushing the stylizing further up generates more abstract renderings. As shown above, these are completely far from want a mall looks like. There’s also mountains, clouds, and other structures in the resulting images that don’t make any sense. When testing these high stylizing options on other generations, those features mentioned always show up. Regardless of what the subject is, multi-color clouds, mountains, and other fantasy objects appear.

/imagine a polaroid of the inside of an abandoned mall –seed 0420 –v 3 –stylize 625 –q .25

Next I decided to explore the quality options while keeping the stylized settings as low as possible. The lowest quality, .25, produced renderings closer to what the V2 algorithm did. The images just don’t seem that appealing as the quality is really low.

/imagine a polaroid of the inside of an abandoned mall –seed 0420 –v 3 –stylize 625 –q .5

Pushing the quality up a notch while keeping the stylizing the same got me much closer to V2 renderings. The proper aesthetic for an abandoned mall is there and the images do look like they were taken with a Polaroid. I’m happy with these, but there’s still some exploring to do with other options.

/imagine a polaroid of the inside of an abandoned mall –seed 0420 –v 3 –stylize 1250 –q .5

For the final rendering, I put the stylizing back to the default and still kept the quality down a bit. This produced images that are the closest I was able to come to the V2 aesthetic of a Polaroid of an abandoned mall. Not all the grit from V2 is there, but enough of the characteristic remain for it to be usable.

The Differences Between Dall-E and MidJourney

I was very fortunate to get access to two very powerful text-to-image AIs within a short period of time. I had been on MidJourney for a few weeks before my email to join OpenAI’s Dall-E 2 platform arrived. I honestly forgot I even applied to Dall-E so it was a very pleasant surprise.

Dall-E 2 is a much more powerful AI. It produces incredibly detailed renderings that could easily be confused with real life images. The accuracy is quite stunning and frightening at the same time.

Unicorn Loafers generated by Dall-E 2

While experimenting with different prompts, I decided to try the same prompt in both systems to see what images they would produce. From these experiments, I got the impression that MidJourney produces more abstract renderings whereas Dall-E is more literal. When using a single word prompt, like “nothing”, MidJourney produced images that invoked the emotion or demonstrated what the meaning of the word was; whereas Dall-E produced images that were literal examples of the word.

As you can see above, the differences are distinct. I know the AI’s were trained on different models, but with working with them it is important to know where their strengths lie.

If I want to create an incredibly realistic looking literal thing I will use Dall-E. If I want more abstract, concept art style rendering, I will use MidJourney to complete that task. Not to say either cannot produces images like the other, but I’ve found them to just be stronger at one thing and suit very different purposes.

My Journey with AI

The start of my journey into AI art starts with getting into digital glitch art. Back round April, I decided to take a leap into the realm of digital glitch art. I had been doing digital photography for a few years taking pictures at the local open Jam and being hired by a few artists to do promotional work. Glitch art really caught my attention and I dove as deep as I could into the community with ending up on Rob Sheridan’s Discord server.

He was making several posts about the Volstof Institute and talked about how it was made with an AI called MidJourney. I quickly signed up for the beta and within a few weeks I received the invite to the Discord server and I was on my way to exploring the possibilities that await. I received access in late May

It was like drinking from a fire hose. There was so much happening on the Discord with people collaborating and all of us figuring out how to use it. My first images were kinda meh, but I started to get an understanding of how best to use MidJourney and get the results I want.

Early attempt at generating

After getting to know MidJourney better and turn out the renderings I was looking for, the usefulness and just how powerful MidJourney was came to light. Probably one of the strongest use cases for these text-to-image AI’s is to rapidly prototype and produce concept art. The renderings from MidJourney are rarely perfect, but they are able to capture the style and concept trying to be expressed. This technology is incredibly powerful and will change this industry fundamentally.