From Digital Age to Nano Age. WorldWide.

Tag: Generative AI

Robotic Automations

Google Gemini: Everything you need to know about the new generative AI platform | TechCrunch


Google’s trying to make waves with Gemini, its flagship suite of generative AI models, apps and services.

So what is Gemini? How can you use it? And how does it stack up to the competition?

To make it easier to keep up with the latest Gemini developments, we’ve put together this handy guide, which we’ll keep updated as new Gemini models, features and news about Google’s plans for Gemini are released.

What is Gemini?

Gemini is Google’s long-promised, next-gen GenAI model family, developed by Google’s AI research labs DeepMind and Google Research. It comes in three flavors:

  • Gemini Ultra, the most performant Gemini model.
  • Gemini Pro, a “lite” Gemini model.
  • Gemini Nano, a smaller “distilled” model that runs on mobile devices like the Pixel 8 Pro.

All Gemini models were trained to be “natively multimodal” — in other words, able to work with and use more than just words. They were pretrained and fine-tuned on a variety of audio, images and videos, a large set of codebases and text in different languages.

This sets Gemini apart from models such as Google’s own LaMDA, which was trained exclusively on text data. LaMDA can’t understand or generate anything other than text (e.g., essays, email drafts), but that isn’t the case with Gemini models.

What’s the difference between the Gemini apps and Gemini models?

Image Credits: Google

Google, proving once again that it lacks a knack for branding, didn’t make it clear from the outset that Gemini is separate and distinct from the Gemini apps on the web and mobile (formerly Bard). The Gemini apps are simply an interface through which certain Gemini models can be accessed — think of it as a client for Google’s GenAI.

Incidentally, the Gemini apps and models are also totally independent from Imagen 2, Google’s text-to-image model that’s available in some of the company’s dev tools and environments.

What can Gemini do?

Because the Gemini models are multimodal, they can in theory perform a range of multimodal tasks, from transcribing speech to captioning images and videos to generating artwork. Some of these capabilities have reached the product stage yet (more on that later), and Google’s promising all of them — and more — at some point in the not-too-distant future.

Of course, it’s a bit hard to take the company at its word.

Google seriously underdelivered with the original Bard launch. And more recently it ruffled feathers with a video purporting to show Gemini’s capabilities that turned out to have been heavily doctored and was more or less aspirational.

Still, assuming Google is being more or less truthful with its claims, here’s what the different tiers of Gemini will be able to do once they reach their full potential:

Gemini Ultra

Google says that Gemini Ultra — thanks to its multimodality — can be used to help with things like physics homework, solving problems step-by-step on a worksheet and pointing out possible mistakes in already filled-in answers.

Gemini Ultra can also be applied to tasks such as identifying scientific papers relevant to a particular problem, Google says — extracting information from those papers and “updating” a chart from one by generating the formulas necessary to re-create the chart with more recent data.

Gemini Ultra technically supports image generation, as alluded to earlier. But that capability hasn’t made its way into the productized version of the model yet — perhaps because the mechanism is more complex than how apps such as ChatGPT generate images. Rather than feed prompts to an image generator (like DALL-E 3, in ChatGPT’s case), Gemini outputs images “natively,” without an intermediary step.

Gemini Ultra is available as an API through Vertex AI, Google’s fully managed AI developer platform, and AI Studio, Google’s web-based tool for app and platform developers. It also powers the Gemini apps — but not for free. Access to Gemini Ultra through what Google calls Gemini Advanced requires subscribing to the Google One AI Premium Plan, priced at $20 per month.

The AI Premium Plan also connects Gemini to your wider Google Workspace account — think emails in Gmail, documents in Docs, presentations in Sheets and Google Meet recordings. That’s useful for, say, summarizing emails or having Gemini capture notes during a video call.

Gemini Pro

Google says that Gemini Pro is an improvement over LaMDA in its reasoning, planning and understanding capabilities.

An independent study by Carnegie Mellon and BerriAI researchers found that the initial version of Gemini Pro was indeed better than OpenAI’s GPT-3.5 at handling longer and more complex reasoning chains. But the study also found that, like all large language models, this version of Gemini Pro particularly struggled with mathematics problems involving several digits, and users found examples of bad reasoning and obvious mistakes.

Google promised remedies, though — and the first arrived in the form of Gemini 1.5 Pro.

Designed to be a drop-in replacement, Gemini 1.5 Pro is improved in a number of areas compared with its predecessor, perhaps most significantly in the amount of data that it can process. Gemini 1.5 Pro can take in ~700,000 words, or ~30,000 lines of code — 35x the amount Gemini 1.0 Pro can handle. And — the model being multimodal — it’s not limited to text. Gemini 1.5 Pro can analyze up to 11 hours of audio or an hour of video in a variety of different languages, albeit slowly (e.g., searching for a scene in a one-hour video takes 30 seconds to a minute of processing).

Gemini 1.5 Pro entered public preview on Vertex AI in April.

An additional endpoint, Gemini Pro Vision, can process text and imagery — including photos and video — and output text along the lines of OpenAI’s GPT-4 with Vision model.

Using Gemini Pro in Vertex AI. Image Credits: Gemini

Within Vertex AI, developers can customize Gemini Pro to specific contexts and use cases using a fine-tuning or “grounding” process. Gemini Pro can also be connected to external, third-party APIs to perform particular actions.

In AI Studio, there’s workflows for creating structured chat prompts using Gemini Pro. Developers have access to both Gemini Pro and the Gemini Pro Vision endpoints, and they can adjust the model temperature to control the output’s creative range and provide examples to give tone and style instructions — and also tune the safety settings.

Gemini Nano

Gemini Nano is a much smaller version of the Gemini Pro and Ultra models, and it’s efficient enough to run directly on (some) phones instead of sending the task to a server somewhere. So far, it powers a couple of features on the Pixel 8 Pro, Pixel 8 and Samsung Galaxy S24, including Summarize in Recorder and Smart Reply in Gboard.

The Recorder app, which lets users push a button to record and transcribe audio, includes a Gemini-powered summary of your recorded conversations, interviews, presentations and other snippets. Users get these summaries even if they don’t have a signal or Wi-Fi connection available — and in a nod to privacy, no data leaves their phone in the process.

Gemini Nano is also in Gboard, Google’s keyboard app. There, it powers a feature called Smart Reply, which helps to suggest the next thing you’ll want to say when having a conversation in a messaging app. The feature initially only works with WhatsApp but will come to more apps over time, Google says.

And in the Google Messages app on supported devices, Nano enables Magic Compose, which can craft messages in styles like “excited,” “formal” and “lyrical.”

Is Gemini better than OpenAI’s GPT-4?

Google has several times touted Gemini’s superiority on benchmarks, claiming that Gemini Ultra exceeds current state-of-the-art results on “30 of the 32 widely used academic benchmarks used in large language model research and development.” The company says that Gemini 1.5 Pro, meanwhile, is more capable at tasks like summarizing content, brainstorming and writing than Gemini Ultra in some scenarios; presumably this will change with the release of the next Ultra model.

But leaving aside the question of whether benchmarks really indicate a better model, the scores Google points to appear to be only marginally better than OpenAI’s corresponding models. And — as mentioned earlier — some early impressions haven’t been great, with users and academics pointing out that the older version of Gemini Pro tends to get basic facts wrong, struggles with translations and gives poor coding suggestions.

How much does Gemini cost?

Gemini 1.5 Pro is free to use in the Gemini apps and, for now, AI Studio and Vertex AI.

Once Gemini 1.5 Pro exits preview in Vertex, however, the model will cost $0.0025 per character while output will cost $0.00005 per character. Vertex customers pay per 1,000 characters (about 140 to 250 words) and, in the case of models like Gemini Pro Vision, per image ($0.0025).

Let’s assume a 500-word article contains 2,000 characters. Summarizing that article with Gemini 1.5 Pro would cost $5. Meanwhile, generating an article of a similar length would cost $0.1.

Ultra pricing has yet to be announced.

Where can you try Gemini?

Gemini Pro

The easiest place to experience Gemini Pro is in the Gemini apps. Pro and Ultra are answering queries in a range of languages.

Gemini Pro and Ultra are also accessible in preview in Vertex AI via an API. The API is free to use “within limits” for the time being and supports certain regions, including Europe, as well as features like chat functionality and filtering.

Elsewhere, Gemini Pro and Ultra can be found in AI Studio. Using the service, developers can iterate prompts and Gemini-based chatbots and then get API keys to use them in their apps — or export the code to a more fully featured IDE.

Code Assist (formerly Duet AI for Developers), Google’s suite of AI-powered assistance tools for code completion and generation, is using Gemini models. Developers can perform “large-scale” changes across codebases, for example updating cross-file dependencies and reviewing large chunks of code.

Google’s brought Gemini models to its dev tools for Chrome and Firebase mobile dev platform, and its database creation and management tools. And it’s launched new security products underpinned by Gemini, like Gemini in Threat Intelligence, a component of Google’s Mandiant cybersecurity platform that can analyze large portions of potentially malicious code and let users perform natural language searches for ongoing threats or indicators of compromise.

Gemini Nano

Gemini Nano is on the Pixel 8 Pro, Pixel 8 and Samsung Galaxy S24 — and will come to other devices in the future. Developers interested in incorporating the model into their Android apps can sign up for a sneak peek.

Is Gemini coming to the iPhone?

It might! Apple and Google are reportedly in talks to put Gemini to use for a number of features to be included in an upcoming iOS update later this year. Nothing’s definitive, as Apple is also reportedly in talks with OpenAI, and has been working on developing its own GenAI capabilities.

This post was originally published Feb. 16, 2024 and has since been updated to include new information about Gemini and Google’s plans for it.


Software Development in Sri Lanka

Robotic Automations

NIST launches a new platform to assess generative AI | TechCrunch


The National Institute of Standards and Technology (NIST), the U.S. Commerce Department agency that develops and tests tech for the U.S. government, corporations and the broader public, today announced the launch of NIST GenAI, a new program spearheaded by NIST to assess generative AI technologies, including text- and image-generating AI.

A platform designed to evaluate various forms of generative AI tech, NIST GenAI will release benchmarks, help create “content authenticity” detection (i.e. deepfake-checking) systems and encourage the development of software to spot the source of fake or misleading information, explains NIST on its newly-launched NIST GenAI site and in a press release.

“The NIST GenAI program will issue a series of challenge problems designed to evaluate and measure the capabilities and limitations of generative AI technologies,” the press release reads. “These evaluations will be used to identify strategies to promote information integrity and guide the safe and responsible use of digital content.”

NIST GenAI’s first project is a pilot study to build systems that can reliably tell the difference between human-created and AI-generated media, starting with text. (While many services purport to detect deepfakes, studies — and our own testing — have shown them to be unreliable, particularly when it comes to text.) NIST GenAI is inviting teams from academia, industry and research labs to submit either “generators” — AI systems to generate content — or “discriminators” — systems that try to identify AI-generated content.

Generators in the study must generate summaries provided a topic and a set of documents, while discriminators must detect if a given summary is AI-written or not. To ensure fairness, NIST GenAI will provide the data necessary to train generators and discriminators; systems trained on publicly available data won’t be accepted, including but not limited to open models like Meta’s Llama 3.

Registration for the pilot will begin May 1, with the results scheduled to be published in February 2025.

NIST GenAI’s launch — and deepfake-focused study — comes as deepfakes grow exponentially.

According to data from Clarity, a deepfake detection firm, 900% more deepfakes have been created this year compared to the same time frame last year. It’s causing alarm, understandably. A recent poll from YouGov found that 85% of Americans said they were concerned about the spread of misleading deepfakes online.

The launch of NIST GenAI is a part of NIST’s response to President Joe Biden’s executive order on AI, which laid out rules requiring greater transparency from AI companies about how their models work and established a raft of new standards, including for labeling content generated by AI.

It’s also the first AI-related announcement from NIST after the appointment of Paul Christiano, a former OpenAI researcher, to the agency’s AI Safety Institute.

Christiano was a controversial choice for his “doomerist” views; he once predicted that “there’s a 50% chance AI development could end in [humanity’s destruction]” Critics — including scientists within NIST, reportedly — fear Cristiano may encourage the AI Safety Institute to focus to “fantasy scenarios” rather than realistic, more immediate risks from AI.

NIST says that NIST GenAI will inform the AI Safety Institute’s work.


Software Development in Sri Lanka

Robotic Automations

Copilot Workspace is GitHub's take on AI-powered software engineering | TechCrunch


Is the future of software development an AI-powered IDE? GitHub’s floating the idea.

At its annual GitHub Universe conference in San Francisco on Monday, GitHub announced Copilot Workspace, a dev environment that taps what GitHub describes as “Copilot-powered agents” to help developers brainstorm, plan, build, test and run code in natural language.

Jonathan Carter, head of GitHub Next, GitHub’s software R&D team, pitches Workspace as somewhat of an evolution of GitHub’s AI-powered coding assistant Copilot into a more general tool, building on recently introduced capabilities like Copilot Chat, which lets developers ask questions about code in natural language.

“Through research, we found that, for many tasks, the biggest point of friction for developers was in getting started, and in particular knowing how to approach a [coding] problem, knowing which files to edit and knowing how to consider multiple solutions and their trade-offs,” Carter said. “So we wanted to build an AI assistant that could meet developers at the inception of an idea or task, reduce the activation energy needed to begin and then collaborate with them on making the necessary edits across the entire corebase.”

At last count, Copilot had over 1.8 million paying individual and 50,000 enterprise customers. But Carter envisions a far larger base, drawn in by feature expansions with broad appeal, like Workspace.

“Since developers spend a lot of their time working on [coding issues], we believe we can help empower developers every day through a ‘thought partnership’ with AI,” Carter said. “You can think of Copilot Workspace as a companion experience and dev environment that complements existing tools and workflows and enables simplifying a class of developer tasks … We believe there’s a lot of value that can be delivered in an AI-native developer environment that isn’t constrained by existing workflows.”

There’s certainly internal pressure to make Copilot profitable.

Copilot loses an average of $20 a month per user, according to a Wall Street Journal report, with some customers costing GitHub as much as $80 a month. And the number of rival services continues to grow. There’s Amazon’s CodeWhisperer, which the company made free to individual developers late last year. There are also startups, like MagicTabnineCodegen and Laredo.

Given a GitHub repo or a specific bug within a repo, Workspace — underpinned by OpenAI’s GPT-4 Turbo model — can build a plan to (attempt to) squash the bug or implement a new feature, drawing on an understanding of the repo’s comments, issue replies and larger codebase. Developers get suggested code for the bug fix or new feature, along with a list of the things they need to validate and test that code, plus controls to edit, save, refactor or undo it.

Image Credits: GitHub

The suggested code can be run directly in Workspace and shared among team members via an external link. Those team members, once in Workspace, can refine and tinker with the code as they see fit.

Perhaps the most obvious way to launch Workspace is from the new “Open in Workspace” button to the left of issues and pull requests in GitHub repos. Clicking on it opens a field to describe the software engineering task to be completed in natural language, like, “Add documentation for the changes in this pull request,” which, once submitted, gets added to a list of “sessions” within the new dedicated Workspace view.

Image Credits: GitHub

Workspace executes requests systematically step by step, creating a specification, generating a plan and then implementing that plan. Developers can dive into any of these steps to get a granular view of the suggested code and changes and delete, re-run or re-order the steps as necessary.

“If you ask any developer where they tend to get stuck with a new project, you’ll often hear them say it’s knowing where to start,” Carter said. “Copilot Workspace lifts that burden and gives developers a plan to start iterating from.”

Image Credits: GitHub

Workspace enters technical preview on Monday, optimized for a range of devices including mobile.

Importantly, because it’s in preview, Workspace isn’t covered by GitHub’s IP indemnification policy, which promises to assist with the legal fees of customers facing third-party claims alleging that the AI-generated code they’re using infringes on IP. (Generative AI models notoriously regurgitate their training data sets, and GPT-4 Turbo was trained partly on copyrighted code.)

GitHub says that it hasn’t determined how it’s going to productize Workspace, but that it’ll use the preview to “learn more about the value it delivers and how developers use it.”

I think the more important question is: Will Workspace fix the existential issues surrounding Copilot and other AI-powered coding tools?

An analysis of over 150 million lines of code committed to project repos over the past several years by GitClear, the developer of the code analysis tool of the same name, found that Copilot was resulting in more mistaken code being pushed to codebases and more code being re-added as opposed to reused and streamlined, creating headaches for code maintainers.

Elsewhere, security researchers have warned that Copilot and similar tools can amplify existing bugs and security issues in software projects. And Stanford researchers have found that developers who accept suggestions from AI-powered coding assistants tend to produce less secure code. (GitHub stressed to me that it uses an AI-based vulnerability prevention system to try to block insecure code in addition to an optional code duplication filter to detect regurgitations of public code.)

Yet devs aren’t shying away from AI.

In a StackOverflow poll from June 2023, 44% of developers said that they use AI tools in their development process now, and 26% plan to soon. Gartner predicts that 75% of enterprise software engineers will employ AI code assistants by 2028.

By emphasizing human review, perhaps Workspace can indeed help clean up some of the mess introduced by AI-generated code. We’ll find out soon enough as Workspace makes its way into developers’ hands.

“Our primary goal with Copilot Workspace is to leverage AI to reduce complexity so developers can express their creativity and explore more freely,” Carter said. “We truly believe the combination of human plus AI is always going to be superior to one or the other alone, and that’s what we’re betting on with Copilot Workspace.”


Software Development in Sri Lanka

Robotic Automations

How RPA vendors aim to remain relevant in a world of AI agents | TechCrunch


What’s the next big thing in enterprise automation? If you ask the tech giants, it’s agents — driven by generative AI.

There’s no universally accepted definition of agent, but these days the term is used to describe generative AI-powered tools that can perform complex tasks through human-like interactions across software and web platforms.

For example, an agent could create an itinerary by filling in a customer’s info on airlines’ and hotel chains’ websites. Or an agent could order the least expensive ride-hailing service to a location by automatically comparing prices across apps.

Vendors sense opportunity. ChatGPT maker OpenAI is reportedly deep into developing AI agent systems. And Google demoed a slew of agent-like products at its annual Cloud Next conference in early April.

“Companies should start preparing for wide-scale adoption of autonomous agents today,” analysts at Boston Consulting Group wrote recently in a report — citing experts who estimate that autonomous agents will go mainstream in three to five years.

Old-school automation

So where does that leave RPA?

Robotic process automation (RPA) came into vogue over a decade ago as enterprises turned to the tech to bolster their digital transformation efforts while reducing costs. Like an agent, RPA drives workflow automation. But it’s a much more rigid form, based on “if-then” preset rules for processes that can be broken down into strictly defined, discretized steps.

“RPA can mimic human actions, such as clicking, typing or copying and pasting, to perform tasks faster and more accurately than humans,” Saikat Ray, VP analyst at Gartner, explained to TechCrunch in an interview. “However, RPA bots have limitations when it comes to handling complex, creative or dynamic tasks that require natural language processing or reasoning skills.”

This rigidity makes RPA expensive to build — and considerably limits its applicability.

A 2022 survey from Robocorp, an RPA vendor, finds that of the organizations that say they’ve adopted RPA, 69% experience broken automation workflows at least once a week — many of which take hours to fix. Entire businesses have been made out of helping enterprises manage their RPA installations and prevent them from breaking.

RPA vendors aren’t naive. They’re well aware of the challenges — and believe that generative AI could solve many of them without hastening their platforms’ demise. In RPA vendors’ minds, RPA and generative AI-powered agents can peacefully co-exist — and perhaps one day even grow to complement each other.

Generative AI automation

UiPath, one of the larger players in the RPA market with an estimated 10,000+ customers, including Uber, Xerox and CrowdStrike, recently announced new generative AI features focused on document and message processing, as well as taking automated actions to deliver what UiPath CEO Bob Enslin calls “one-click digital transformation.”

“These features provide customers generative AI models that are trained for their specific tasks,” Enslin told TechCrunch. “Our generative AI powers workloads such as text completion for emails, categorization, image detection, language translation, the ability to filter out personally identifiable information [and] quickly answering any people-topic-related questions based off of knowledge from internal data.”

One of UiPath’s more recent explorations in the generative AI domain is Clipboard AI, which combines UiPath’s platform with third-party models from OpenAI, Google and others to — as Enslin puts it — “bring the power of automation to anyone that has to copy/paste.” Clipboard AI lets users highlight data from a form, and — leveraging generative AI to figure out the right places for the copied data to go — point it to another form, app, spreadsheet or database.

Image Credits: UiPath

“UiPath sees the need to bring action and AI together; this is where value is created,” Enslin said. “We believe the best performance will come from those that combine generative AI and human judgment — what we call human-in-the-loop — across end-to-end processes.”

Automation Anywhere, UiPath’s main rival, is also attempting to fold generative AI into its RPA technologies.

Last year, Automation Anywhere launched generative AI-powered tools to create workflows from natural language, summarize content, extract data from documents and — perhaps most significantly — adapt to changes in apps that would normally cause an RPA automation to fail.

“[Our generative AI models are] developed on top of [open] large language models and trained with anonymized metadata from more than 150 million automation processes across thousands of enterprise applications,” Peter White, SVP of enterprise AI and automation at Automation Anywhere, told TechCrunch. “We continue to build custom machine learning models for specific tasks within our platform and are also now building customized models on top of foundational generative AI models using our automation datasets.”

Next-gen RPA

Ray notes it’s important to be cognizant of generative AI’s limitations — namely biases and hallucinations — as it powers a growing number of RPA capabilities. But, risks aside, he believes generative AI stands to add value to RPA by transforming the way these platforms work and “creating new possibilities for automation.”

“Generative AI is a powerful technology that can enhance the capabilities of RPA platforms enabling them to understand and generate natural language, automate content creation, improve decision-making and even generate code,” Ray said. “By integrating generative AI models, RPA platforms can offer more value to their customers, increase their productivity and efficiency and expand their use cases and applications.”

Craig Le Clair, principal analyst at Forrester, sees RPA platforms as being ripe for expansion to support autonomous agents and generative AI as their use cases grow. In fact, he anticipates RPA platforms morphing into all-around toolsets for automation — toolsets that help deploy RPA in addition to related generative AI technologies.

“RPA platforms have the architecture to manage thousands of task automations and this bodes well for central management of AI agents,” he said. “Thousands of companies are well established with RPA platforms and will be open to using them for generative AI-infused agents. RPA has grown in part thanks to its ability to integrate easily with existing work patterns, through UI integration, and this will remain valuable for more intelligent agents going forward.”

UiPath is already beginning to take steps in this direction with a new capability, Context Grounding, that entered preview earlier in the month. As Enslin explained it to me, Context Grounding is designed to improve the accuracy of generative AI models — both first- and third-party — by converting business data those models might draw on into an “optimized” format that’s easier to index and search.

“Context Grounding extracts information from company-specific datasets, like a knowledge base or internal policies and procedures, to create more accurate and insightful responses,” Enslin said.

If there’s anything holding RPA vendors back, it’s the ever-present temptation to lock customers in, Le Clair said. He stressed the need for platforms to “remain agnostic” and offer tools that can be configured to work with a range of current — and future — enterprise systems and workflows.

To that, Enslin pledged that UiPath will remain “open, flexible and responsible.”

“The future of AI will require a combination of specialized AI with generative AI,” he continued. “We want customers to be able to confidently use all kinds of AI.”

White didn’t commit to neutrality exactly. But he emphasized that Automation Anywhere’s roadmap is being heavily shaped by customer feedback.

“What we hear from every customer, across every industry, is that their ability to incorporate automation in many more use cases has increased exponentially with generative AI,” he said. “With generative AI infused into intelligent automation technologies like RPA, we see the potential for organizations to reduce operating costs and increase productivity. Companies who fail to adopt these technologies will struggle to compete against others who embrace generative AI and automation.”


Software Development in Sri Lanka

Robotic Automations

Creators of Sora-powered short explain AI-generated video's strengths and limitations | TechCrunch


OpenAI’s video generation tool Sora took the AI community by surprise in February with fluid, realistic video that seems miles ahead of competitors. But the carefully stage-managed debut left out a lot of details — details that have been filled in by a filmmaker given early access to create a short using Sora.

Shy Kids is a digital production team based in Toronto that was picked by OpenAI as one of a few to produce short films essentially for OpenAI promotional purposes, though they were given considerable creative freedom in creating “air head.” In an interview with visual effects news outlet fxguide, post-production artist Patrick Cederberg described “actually using Sora” as part of his work.

Perhaps the most important takeaway for most is simply this: While OpenAI’s post highlighting the shorts lets the reader assume they more or less emerged fully formed from Sora, the reality is that these were professional productions, complete with robust storyboarding, editing, color correction, and post work like rotoscoping and VFX. Just as Apple says “shot on iPhone” but doesn’t show the studio setup, professional lighting, and color work after the fact, the Sora post only talks about what it lets people do, not how they actually did it.

Cederberg’s interview is interesting and quite non-technical, so if you’re interested at all, head over to fxguide and read it. But here are some interesting nuggets about using Sora that tell us that, as impressive as it is, the model is perhaps less of a giant leap forward than we thought.

Control is still the thing that is the most desirable and also the most elusive at this point. … The closest we could get was just being hyper-descriptive in our prompts. Explaining wardrobe for characters, as well as the type of balloon, was our way around consistency because shot to shot / generation to generation, there isn’t the feature set in place yet for full control over consistency.

In other words, matters that are simple in traditional filmmaking, like choosing the color of a character’s clothing, take elaborate workarounds and checks in a generative system, because each shot is created independent of the others. That could obviously change, but it is certainly much more laborious at the moment.

Sora outputs had to be watched for unwanted elements as well: Cederberg described how the model would routinely generate a face on the balloon that the main character has for a head, or a string hanging down the front. These had to be removed in post, another time-consuming process, if they couldn’t get the prompt to exclude them.

Precise timing and movements of characters or the camera aren’t really possible: “There’s a little bit of temporal control about where these different actions happen in the actual generation, but it’s not precise … it’s kind of a shot in the dark,” said Cederberg.

For example, timing a gesture like a wave is a very approximate, suggestion-driven process, unlike manual animations. And a shot like a pan upward on the character’s body may or may not reflect what the filmmaker wants — so the team in this case rendered a shot composed in portrait orientation and did a crop pan in post. The generated clips were also often in slow motion for no particular reason.

Example of a shot as it came out of Sora and how it ended up in the short. Image Credits: Shy Kids

In fact, using the everyday language of filmmaking, like “panning right” or “tracking shot” were inconsistent in general, Cederberg said, which the team found pretty surprising.

“The researchers, before they approached artists to play with the tool, hadn’t really been thinking like filmmakers,” he said.

As a result, the team did hundreds of generations, each 10 to 20 seconds, and ended up using only a handful. Cederberg estimated the ratio at 300:1 — but of course we would probably all be surprised at the ratio on an ordinary shoot.

The team actually did a little behind-the-scenes video explaining some of the issues they ran into, if you’re curious. Like a lot of AI-adjacent content, the comments are pretty critical of the whole endeavor — though not quite as vituperative as the AI-assisted ad we saw pilloried recently.

The last interesting wrinkle pertains to copyright: If you ask Sora to give you a “Star Wars” clip, it will refuse. And if you try to get around it with “robed man with a laser sword on a retro-futuristic spaceship,” it will also refuse, as by some mechanism it recognizes what you’re trying to do. It also refused to do an “Aronofsky type shot” or a “Hitchcock zoom.”

On one hand, it makes perfect sense. But it does prompt the question: If Sora knows what these are, does that mean the model was trained on that content, the better to recognize that it is infringing? OpenAI, which keeps its training data cards close to the vest — to the point of absurdity, as with CTO Mira Murati’s interview with Joanna Stern — will almost certainly never tell us.

As for Sora and its use in filmmaking, it’s clearly a powerful and useful tool in its place, but its place is not “creating films out of whole cloth.” Yet. As another villain once famously said, “that comes later.”




Software Development in Sri Lanka

Robotic Automations

BigPanda launches generative AI tool designed specifically for ITOps | TechCrunch


IT operations personnel have a lot going on, and when an incident occurs that brings down a key system, time is always going to be against them. Over the years, companies have looked for an edge in getting up faster with playbooks designed to find answers to common problems, and postmortems to keep them from repeating, but not every problem is easily solved, and there is so much data and so many possible points of failure.

It’s actually a perfect problem for generative AI to solve, and AIOps startup BigPanda announced a new generative AI tool today called Biggy to help solve some of these issues faster. Biggy is designed to look across a wide variety of IT-related data to learn how the company operates and compare it to the problem scenario and other similar scenarios and suggest a solution.

BigPanda has been using AI since the early days of the company and deliberately designed two separate systems: one for the data layer and another for the AI. This in a way prepared them for this shift to generative AI based on large language models. “The AI engine before Gen AI was building a lot of other types of AI, but it was feeding off of the same data engine that will be feeding what we’re doing with Biggy, and what we’re doing with generative and conversational AI,” BigPanda CEO Assaf Resnick told TechCrunch.

Like most generative AI tools, this one makes a prompt box available where users can ask questions and interact with the bot. In this case, the underlying models have been trained on data inside the customer company, as well as on publicly available data on a particular piece of hardware or software, and are tuned to deal with the kinds of problems IT deals with on a regular basis.

“The out-of-the box LLMs have been trained on a huge amount of data, and they’re really good actually as generalists in all of the operational fields we look at — infrastructure, network, application development, everything there. And they actually know all the hardware very well,” Jason Walker, chief innovation officer at BigPanda, said. “So if you ask it about a certain HP blade server with this error code, it’s pretty good at putting that together, and we use that for a lot of the event traffic.” Of course, it has to be more than that or a human engineer could simply look this up in Google Search.

It combines this knowledge with what it is able to cull internally across a range of data types. “BigPanda ingests the customer’s operational and contextual data from observability, change, CDMB (the file that stores configuration information) and topology along with historical data and human, institutional context — and normalizes the data into key-value pairs, or tags,” Walker said. That’s a lot of technical jargon, but basically it means it looks at system-level information, organizational data and human interactions to deliver a response to help engineers solve the problem.

When a user enters a prompt, it looks across all the data to generate an answer that will hopefully point the engineers in the right direction to fix the problem. They acknowledge that it’s not always perfect because no generative AI is, but they let the user know when there is a lower degree of certainty that the answer is correct.

“For areas where we think we don’t have as much certainty, then we tell them that this is our best information, but a human should take a look at this,” Resnick said. For other areas where there is more certainty, they may introduce automation, working with a tool like Red Hat Ansible to solve the issue without human interaction, he said.

The data ingestion part isn’t always going to be trivial for customers, and this is a first step toward providing an AI assistant that can help IT get at the root of problems and solve them faster. No AI is foolproof, but having an interactive AI tool should be an improvement over current, more time-consuming manual approaches to IT systems troubleshooting.


Software Development in Sri Lanka

Robotic Automations

Exclusive: Eric Schmidt-backed Augment, a GitHub Copilot rival, launches out of stealth with $252M


AI is supercharging coding — and developers are embracing it.

In a recent StackOverflow poll, 44% of software engineers said that they use AI tools as part of their development processes now and 26% plan to soon. Gartner estimates that over half of organizations are currently piloting or have already deployed AI-driven coding assistants, and that 75% of developers will use coding assistants in some form by 2028.

Ex-Microsoft software developer Igor Ostrovsky believes that soon, there won’t be a developer who doesn’use AI in their workflows. “Software engineering remains a difficult and all-too-often tedious and frustrating job, particularly at scale,” he told TechCrunch. “AI can improve software quality, team productivity and help restore the joy of programming.”

So Ostrovsky decided to build the AI-powered coding platform that he himself would want to use.

That platform is Augment, and on Wednesday it emerged from stealth with $252 million in funding at a near-unicorn ($977 million) post-money valuation. With investments from former Google CEO Eric Schmidt and VCs including Index Ventures, Sutter Hill Ventures, Lightspeed Venture Partners, Innovation Endeavors and Meritech Capital, Augment aims to shake up the still-nascent market for generative AI coding technologies.

“Most companies are dissatisfied with the programs they produce and consume; software is too often fragile, complex and expensive to maintain with development teams bogged down with long backlogs for feature requests, bug fixes, security patches, integration requests, migrations and upgrades,” Ostrovsky said. “Augment has both the best team and recipe for empowering programmers and their organizations to deliver high-quality software quicker.”

Ostrovsky spent nearly seven years at Microsoft before joining Pure Storage, a startup developing flash data storage hardware and software products, as a founding engineer. While at Microsoft, Ostrovsky worked on components of Midori, a next-generation operating system the company never released but whose concepts have made their way into other Microsoft projects over the last decade.

In 2022, Ostrovsky and Guy Gur-Ari, previously an AI research scientist at Google, teamed up to create Augment’s MVP. To fill out the startup’s executive ranks, Ostrovsky and Gur-Ari brought on Scott Dietzen, ex-CEO of Pure Storage, and Dion Almaer, formerly a Google engineering director and a VP of engineering at Shopify.

Augment remains a strangely hush-hush operation.

In our conversation, Ostrovsky wasn’t willing to say much about the user experience or even the generative AI models driving Augment’s features (whatever they may be) — save that Augment is using fine-tuned “industry-leading” open models of some sort.

He did say how Augment plans to make money: standard software-as-a-service subscriptions. Pricing and other details will be revealed later this year, Ostrovsky added, closer to Augment’s planned GA release.

“Our funding provides many years of runway to continue to build what we believe to be the best team in enterprise AI,” he said. “We’re accelerating product development and building out Augment’s product, engineering and go-to-market functions as the company gears up for rapid growth.”

Rapid growth is perhaps the best shot Augment has at making waves in an increasingly cutthroat industry.

Practically every tech giant offers its own version of an AI coding assistant. Microsoft has GitHub Copilot, which is by far the firmest entrenched with over 1.3 million paying individual and 50,000 enterprise customers as of February. Amazon has AWS’ CodeWhisperer. And Google has Gemini Code Assist, recently rebranded from Duet AI for Developers.

Elsewhere, there’s a torrent of coding assistant startups: MagicTabnineCodegen, Refact, TabbyML, Sweep, Laredo and Cognition (which reportedly just raised $175 million), to name a few. Harness and JetBrains, which developed the Kotlin programming language, recently released their own. So did Sentry (albeit with more of a cybersecurity bent). 

Can they all — plus Augment now — do business harmoniously together? It seems unlikely. Eye-watering compute costs alone make the AI coding assistant business a challenging one to maintain. Overruns related to training and serving models forced generative AI coding startup Kite to shut down in December 2022. Even Copilot loses money, to the tune of around $20 to $80 a month per user, according to The Wall Street Journal.

Ostrovsky implies that there’s momentum behind Augment already; he claims that “hundreds” of software developers across “dozens” of companies including payment startup Keeta (which is also Eric Schmidt-backed) are using Augment in early access. But will the uptake sustain? That’s the million-dollar question, indeed.

I also wonder if Augment has made any steps toward solving the technical setbacks plaguing code-generating AI, particularly around vulnerabilities.

An analysis by GitClear, the developer of the code analytics tool of the same name, found that coding assistants are resulting in more mistaken code being pushed to codebases, creating headaches for software maintainers. Security researchers have warned that generative coding tools tools can amplify existing bugs and exploits in projects. And Stanford researchers have found that developers who accept code recommendations from AI assistants tend to produce less secure code.

Then there’s copyright to worry about.

Augment’s models were undoubtedly trained on publicly available data, like all generative AI models — some of which may’ve been copyrighted or under a restrictive license. Some vendors have argued that fair use doctrine shields them from copyright claims while at the same time rolling out tools to mitigate potential infringement. But that hasn’t stopped coders from filing class action lawsuits over what they allege are open licensing and IP violations.

To all this, Ostrovsky says: “Current AI coding assistants don’t adequately understand the programmer’s intent, improve software quality nor facilitate team productivity, and they don’t properly protect intellectual property. Augment’s engineering team boasts deep AI and systems expertise. We’re poised to bring AI coding assistance innovations to developers and software teams.”

Augment, which is based in Palo Alto, has around 50 employees; Ostrovsky expects that number to double by the end of the year.


Software Development in Sri Lanka

Robotic Automations

Snowflake releases a flagship generative AI model of its own | TechCrunch


All-around, highly generalizable generative AI models were the name of the game once, and they arguably still are. But increasingly, as cloud vendors large and small join the generative AI fray, we’re seeing a new crop of models focused on the deepest-pocketed potential customers: the enterprise.

Case in point: Snowflake, the cloud computing company, today unveiled Arctic LLM, a generative AI model that’s described as “enterprise-grade.” Available under an Apache 2.0 license, Arctic LLM is optimized for “enterprise workloads,” including generating database code, Snowflake says, and is free for research and commercial use.

“I think this is going to be the foundation that’s going to let us — Snowflake — and our customers build enterprise-grade products and actually begin to realize the promise and value of AI,” CEO Sridhar Ramaswamy said in press briefing. “You should think of this very much as our first, but big, step in the world of generative AI, with lots more to come.”

An enterprise model

My colleague Devin Coldewey recently wrote about how there’s no end in sight to the onslaught of generative AI models. I recommend you read his piece, but the gist is: Models are an easy way for vendors to drum up excitement for their R&D and they also serve as a funnel to their product ecosystems (e.g., model hosting, fine-tuning and so on).

Arctic LLM is no different. Snowflake’s flagship model in a family of generative AI models called Arctic, Arctic LLM — which took around three months, 1,000 GPUs and $2 million to train — arrives on the heels of Databricks’ DBRX, a generative AI model also marketed as optimized for the enterprise space.

Snowflake draws a direct comparison between Arctic LLM and DBRX in its press materials, saying Arctic LLM outperforms DBRX on the two tasks of coding (Snowflake didn’t specify which programming languages) and SQL generation. The company said Arctic LLM is also better at those tasks than Meta’s Llama 2 70B (but not the more recent Llama 3 70B) and Mistral’s Mixtral-8x7B.

Snowflake also claims that Arctic LLM achieves “leading performance” on a popular general language understanding benchmark, MMLU. I’ll note, though, that while MMLU purports to evaluate generative models’ ability to reason through logic problems, it includes tests that can be solved through rote memorization, so take that bullet point with a grain of salt.

“Arctic LLM addresses specific needs within the enterprise sector,” Baris Gultekin, head of AI at Snowflake, told TechCrunch in an interview, “diverging from generic AI applications like composing poetry to focus on enterprise-oriented challenges, such as developing SQL co-pilots and high-quality chatbots.”

Arctic LLM, like DBRX and Google’s top-performing generative model of the moment, Gemini 1.5 Pro, is a mixture of experts (MoE) architecture. MoE architectures basically break down data processing tasks into subtasks and then delegate them to smaller, specialized “expert” models. So, while Arctic LLM contains 480 billion parameters, it only activates 17 billion at a time — enough to drive the 128 separate expert models. (Parameters essentially define the skill of an AI model on a problem, like analyzing and generating text.)

Snowflake claims that this efficient design enabled it to train Arctic LLM on open public web data sets (including RefinedWeb, C4, RedPajama and StarCoder) at “roughly one-eighth the cost of similar models.”

Running everywhere

Snowflake is providing resources like coding templates and a list of training sources alongside Arctic LLM to guide users through the process of getting the model up and running and fine-tuning it for particular use cases. But, recognizing that those are likely to be costly and complex undertakings for most developers (fine-tuning or running Arctic LLM requires around eight GPUs), Snowflake’s also pledging to make Arctic LLM available across a range of hosts, including Hugging Face, Microsoft Azure, Together AI’s model-hosting service, and enterprise generative AI platform Lamini.

Here’s the rub, though: Arctic LLM will be available first on Cortex, Snowflake’s platform for building AI- and machine learning-powered apps and services. The company’s unsurprisingly pitching it as the preferred way to run Arctic LLM with “security,” “governance” and scalability.

Our dream here is, within a year, to have an API that our customers can use so that business users can directly talk to data,” Ramaswamy said. “It would’ve been easy for us to say, ‘Oh, we’ll just wait for some open source model and we’ll use it. Instead, we’re making a foundational investment because we think [it’s] going to unlock more value for our customers.”

So I’m left wondering: Who’s Arctic LLM really for besides Snowflake customers?

In a landscape full of “open” generative models that can be fine-tuned for practically any purpose, Arctic LLM doesn’t stand out in any obvious way. Its architecture might bring efficiency gains over some of the other options out there. But I’m not convinced that they’ll be dramatic enough to sway enterprises away from the countless other well-known and -supported, business-friendly generative models (e.g. GPT-4).

There’s also a point in Arctic LLM’s disfavor to consider: its relatively small context.

In generative AI, context window refers to input data (e.g. text) that a model considers before generating output (e.g. more text). Models with small context windows are prone to forgetting the content of even very recent conversations, while models with larger contexts typically avoid this pitfall.

Arctic LLM’s context is between ~8,000 and ~24,000 words, dependent on the fine-tuning method — far below that of models like Anthropic’s Claude 3 Opus and Google’s Gemini 1.5 Pro.

Snowflake doesn’t mention it in the marketing, but Arctic LLM almost certainly suffers from the same limitations and shortcomings as other generative AI models — namely, hallucinations (i.e. confidently answering requests incorrectly). That’s because Arctic LLM, along with every other generative AI model in existence, is a statistical probability machine — one that, again, has a small context window. It guesses based on vast amounts of examples which data makes the most “sense” to place where (e.g. the word “go” before “the market” in the sentence “I go to the market”). It’ll inevitably guess wrong — and that’s a “hallucination.”

As Devin writes in his piece, until the next major technical breakthrough, incremental improvements are all we have to look forward to in the generative AI domain. That won’t stop vendors like Snowflake from championing them as great achievements, though, and marketing them for all they’re worth.


Software Development in Sri Lanka

Robotic Automations

Amazon wants to host companies' custom generative AI models | TechCrunch


AWS, Amazon’s cloud computing business, wants to be the go-to place companies host and fine-tune their custom generative AI models.

Today, AWS announced the launch of Custom Model Import (in preview), a new feature in Bedrock, AWS’ enterprise-focused suite of generative AI services, that allows organizations to import and access their in-house generative AI models as fully managed APIs.

Companies’ proprietary models, once imported, benefit from the same infrastructure as other generative AI models in Bedrock’s library (e.g. Meta’s Llama 3, Anthropic’s Claude 3), including tools to expand their knowledge, fine-tune them and implement safeguards to mitigate their biases.

“There have been AWS customers that have been fine-tuning or building their own models outside of Bedrock using other tools,” Vasi Philomin, VP of generative AI at AWS, told TechCrunch in an interview. “This Custom Model Import capability allows them to bring their own proprietary models to Bedrock and see them right next to all of the other models that are already on Bedrock — and use them with all of the workflows that are also already on Bedrock, as well.”

Importing custom models

According to a recent poll from Cnvrg, Intel’s AI-focused subsidiary, the majority of enterprises are approaching generative AI by building their own models and refining them to their applications. Those same enterprises say that they see infrastructure, including cloud compute infrastructure, as their greatest barrier to deployment, per the poll.

With Custom Model Import, AWS aims to rush in to fill the need while maintaining pace with cloud rivals. (Amazon CEO Andy Jassy foreshadowed as much in his recent annual letter to shareholders.)

For some time, Vertex AI, Google’s analog to Bedrock, has allowed customers to upload generative AI models, tailor them and serve them through APIs. Databricks, too, has long provided toolsets to host and tweak custom models, including its own recently released DBRX.

Asked what sets Custom Model Import apart, Philomin asserted that it — and by extension Bedrock — offer a wider breadth and depth of model customization options than the competition, adding that “tens of thousands” of customers today are using Bedrock.

“Number one, Bedrock provides several ways for customers to deal with serving models,” Philomin said. “Number two, we have a whole bunch of workflows around these models — and now customers’ can stand right next to all of the other models that we have already available. A key thing that most people like about this is the ability to be able to experiment across multiple different models using the same workflows, and then actually take them to production from the same place.”

So what are the alluded-to model customization options?

Philomin points to Guardrails, which lets Bedrock users configure thresholds to filter — or at least attempt to filter — models’ outputs for things like hate speech, violence and private personal or corporate information. (Generative AI models are notorious for going off the rails in problematic ways, including leaking sensitive info; AWS’ have been no exception.) He also highlighted Model Evaluation, a Bedrock tool customers can use to test how well a model — or several — perform across a given set of criteria.

Both Guardrails and Model Evaluation are now generally available following a several-months-long preview.

I feel compelled to note here that Custom Model Import only supports three model architectures at the moment — Hugging Face’s Flan-T5, Meta’s Llama and Mistral’s models — and that Vertex AI and other Bedrock-rivaling services, including Microsoft’s AI development tools on Azure, offer more or less comparable safety and evaluation features (see Azure AI Content Safety, model evaluation in Vertex and so on).

What is unique to Bedrock, though, are AWS’ Titan family of generative AI models. And — coinciding with the release of Custom Model Import — there’s several noteworthy developments on that front.

Upgraded Titan models

Titan Image Generator, AWS’ text-to-image model, is now generally available after launching in preview last November. As before, Titan Image Generator can create new images given a text description or customize existing images, for example swapping out an image background while retaining the subjects in the image.

Compared to the preview version, Titan Image Generator in GA can generate images with more “creativity,” said Philomin, without going into detail. (Your guess as to what that means is as good as mine.)

I asked Philomin if he had any more details to share about how Titan Image Generator was trained.

At the model’s debut last November, AWS was vague about which data, exactly, it used in training Titan Image Generator. Few vendors readily reveal such information; they see training data as a competitive advantage and thus keep it and info relating to it close to the chest.

Training data details are also a potential source of IP-related lawsuits, another disincentive to reveal much. Several cases making their way through the courts reject vendors’ fair use defenses, arguing that text-to-image tools replicate artists’ styles without the artists’ explicit permission and allow users to generate new works resembling artists’ originals for which artists receive no payment.

Philomin would only tell me that AWS uses a combination of first-party and licensed data.

“We have a combination of proprietary data sources, but also we license a lot of data,” he said. “We actually pay copyright owners licensing fees in order to be able to use their data, and we do have contracts with several of them.”

It’s more detail than from November. But I have a feeling that Philomin’s answer won’t satisfy everyone, particularly the content creators and AI ethicists arguing for greater transparency where it concerns generative AI model training.

In lieu of transparency, AWS says it’ll continue to offer an indemnification policy that covers customers in the event a Titan model like Titan Image Generator regurgitates (i.e. spits out a mirror copy of) a potentially copyrighted training example. (Several rivals, including Microsoft and Google, offer similar policies covering their image generation models.)

To address another pressing ethical threat — deepfakes — AWS says that images created with Titan Image Generator will, as during the preview, come with a “tamper-resistant” invisible watermark. Philomin says that the watermark has been made more resistant in the GA release to compression and other image edits and manipulations.

Segueing into less controversial territory, I asked Philomin whether AWS — like Google, OpenAI and others — is exploring video generation given the excitement around (and investment in) the tech. Philomin didn’t say that AWS wasn’t… but he wouldn’t hint at any more than that.

“Obviously, we’re constantly looking to see what new capabilities customers want to have, and video generation definitely comes up in conversations with customers,” Philomin said. “I’d ask you to stay tuned.”

In one last piece of Titan-related news, AWS released the second generation of its Titan Embeddings model, Titan Text Embeddings V2. Titan Text Embeddings V2 converts text to numerical representations called embeddings to power search and personalization applications. So did the first-generation Embeddings model — but AWS claims that Titan Text Embeddings V2 is overall more efficient, cost-effective and accurate.

“What the Embeddings V2 model does is reduce the overall storage [necessary to use the model] by up to four times while retaining 97% of the accuracy,” Philomin claimed, “outperforming other models that are comparable.”

We’ll see if real-world testing bears that out.


Software Development in Sri Lanka

Robotic Automations

Adobe claims its new image generation model is its best yet | TechCrunch


Firefly, Adobe’s family of generative AI models, doesn’t have the best reputation among creatives.

The Firefly image generation model in particular has been derided as underwhelming and flawed compared to Midjourney, OpenAI’s DALL-E 3, and other rivals, with a tendency to distort limbs and landscapes and miss the nuances in prompts. But Adobe is trying to right the ship with its third-generation model, Firefly Image 3, releasing this week during the company’s Max London conference.

The model, now available in Photoshop (beta) and Adobe’s Firefly web app, produces more “realistic” imagery than its predecessor (Image 2) and its predecessor’s predecessor (Image 1) thanks to an ability to understand longer, more complex prompts and scenes as well as improved lighting and text generation capabilities. It should more accurately render things like typography, iconography, raster images and line art, says Adobe, and is “significantly” more adept at depicting dense crowds and people with “detailed features” and “a variety of moods and expressions.”

For what it’s worth, in my brief unscientific testing, Image 3 does appear to be a step up from Image 2.

I wasn’t able to try Image 3 myself. But Adobe PR sent a few outputs and prompts from the model, and I managed to run those same prompts through Image 2 on the web to get samples to compare the Image 3 outputs with. (Keep in mind that the Image 3 outputs could’ve been cherry-picked.)

Notice the lighting in this headshot from Image 3 compared to the one below it, from Image 2:

From Image 3. Prompt: “Studio portrait of young woman.”

Same prompt as above, from Image 2.

The Image 3 output looks more detailed and lifelike to my eyes, with shadowing and contrast that’s largely absent from the Image 2 sample.

Here’s a set of images showing Image 3’s scene understanding at play:

From Image 3. Prompt: “An artist in her studio sitting at desk looking pensive with tons of paintings and ethereal.”

Same prompt as above. From Image 2.

Note the Image 2 sample is fairly basic compared to the output from Image 3 in terms of the level of detail — and overall expressiveness. There’s wonkiness going on with the subject in the Image 3 sample’s shirt (around the waist area), but the pose is more complex than the subject’s from Image 2. (And Image 2’s clothes are also a bit off.)

Some of Image 3’s improvements can no doubt be traced to a larger and more diverse training data set.

Like Image 2 and Image 1, Image 3 is trained on uploads to Adobe Stock, Adobe’s royalty-free media library, along with licensed and public domain content for which the copyright has expired. Adobe Stock grows all the time, and consequently so too does the available training data set.

In an effort to ward off lawsuits and position itself as a more “ethical” alternative to generative AI vendors who train on images indiscriminately (e.g. OpenAI, Midjourney), Adobe has a program to pay Adobe Stock contributors to the training data set. (We’ll note that the terms of the program are rather opaque, though.) Controversially, Adobe also trains Firefly models on AI-generated images, which some consider a form of data laundering.

Recent Bloomberg reporting revealed AI-generated images in Adobe Stock aren’t excluded from Firefly image-generating models’ training data, a troubling prospect considering those images might contain regurgitated copyrighted material. Adobe has defended the practice, claiming that AI-generated images make up only a small portion of its training data and go through a moderation process to ensure they don’t depict trademarks or recognizable characters or reference artists’ names.

Of course, neither diverse, more “ethically” sourced training data nor content filters and other safeguards guarantee a perfectly flaw-free experience — see users generating people flipping the bird with Image 2. The real test of Image 3 will come once the community gets its hands on it.

New AI-powered features

Image 3 powers several new features in Photoshop beyond enhanced text-to-image.

A new “style engine” in Image 3, along with a new auto-stylization toggle, allows the model to generate a wider array of colors, backgrounds and subject poses. They feed into Reference Image, an option that lets users condition the model on an image whose colors or tone they want their future generated content to align with.

Three new generative tools — Generate Background, Generate Similar and Enhance Detail — leverage Image 3 to perform precision edits on images. The (self-descriptive) Generate Background replaces a background with a generated one that blends into the existing image, while Generate Similar offers variations on a selected portion of a photo (a person or an object, for example). As for Enhance Detail, it “fine-tunes” images to improve sharpness and clarity.

If these features sound familiar, that’s because they’ve been in beta in the Firefly web app for at least a month (and Midjourney for much longer than that). This marks their Photoshop debut — in beta.

Speaking of the web app, Adobe isn’t neglecting this alternate route to its AI tools.

To coincide with the release of Image 3, the Firefly web app is getting Structure Reference and Style Reference, which Adobe’s pitching as new ways to “advance creative control.” (Both were announced in March, but they’re now becoming widely available.) With Structure Reference, users can generate new images that match the “structure” of a reference image — say, a head-on view of a race car. Style Reference is essentially style transfer by another name, preserving the content of an image (e.g. elephants in the African Safari) while mimicking the style (e.g. pencil sketch) of a target image.

Here’s Structure Reference in action:

Original image.

Transformed with Structure Reference.

And Style Reference:

Original image.

Transformed with Style Reference.

I asked Adobe if, with all the upgrades, Firefly image generation pricing would change. Currently, the cheapest Firefly premium plan is $4.99 per month — undercutting competition like Midjourney ($10 per month) and OpenAI (which gates DALL-E 3 behind a $20-per-month ChatGPT Plus subscription).

Adobe said that its current tiers will remain in place for now, along with its generative credit system. It also said that its indemnity policy, which states Adobe will pay copyright claims related to works generated in Firefly, won’t be changing either, nor will its approach to watermarking AI-generated content. Content Credentials — metadata to identify AI-generated media — will continue to be automatically attached to all Firefly image generations on the web and in Photoshop, whether generated from scratch or partially edited using generative features.




Software Development in Sri Lanka

Back
WhatsApp
Messenger
Viber