CONTACT
Technology

Biodiversity AI Datasets

Approx 13 min read

Overview

The current generative AI boom (LLM) was achieved by learning from vast amounts of text data on the web, but in the "organism (natural world)" domain, there is still no dataset of sufficient quality and quantity for learning. IKIMON aims not just to run an app, but to build the "world's highest quality organism image dataset" and create a Japan-originated biodiversity-specialized foundation model (Large Nature Model: LNM). This is essential infrastructure to create a world where children of the future can carry around a "pocket expert" with just a smartphone.

Current Challenges: Why is Current AI Weak on Organisms?

1. The "Long Tail" Wall (Data Bias)

Even in the world's largest iNaturalist dataset, there are tens of thousands of images of "common species" like sparrows and cabbage white butterflies, but only a few images (or zero) of endangered species or obscure insects. AI is good at things with lots of data but cannot learn things with little data, resulting in "the rarer the species, the more it's ignored (misidentified)"—the very situation most to be avoided in conservation.

2. The "Misidentification Reproduction" Cycle

Many apps have introduced "AI automatic suggestions," but cases where beginners register as-is thinking "if AI says so, it must be right" are frequent.
  • Incorrect data is registered as "correct" in the DB
  • The next AI learns from that incorrect data
  • Misidentification is reinforced, and nobody can notice the mistakes
This vicious cycle (negative feedback loop) is a major challenge in the current biodiversity AI world.

3. Lack of Context

Many current image recognition AIs are trained on "clean photos of adults." However, in actual nature observation:
  • "Out of focus," "rear view," "partial only"
  • "Larvae," "eggs," "molted shells," "droppings," "footprints"
such diverse states are found. Datasets that can comprehensively judge these are lacking globally.

4. Spatial Bias and the "Luxury Effect"

Citizen science data has a strong bias called the "Luxury Effect"—"the wealthier the urban area, the more data is collected."
  • Reason: Wealthy areas have more green space (more organisms), and residents have "time and mental margin" to observe with smartphones.
  • Problem: AI training data becomes biased toward "creatures in urban parks," and data from original habitats in "rural mountainous areas (depopulated areas)" is not learned, making the AI a "city kid."

IKIMON's Strategy: Japan as a "Data Sanctuary"

1. Leveraging Japan's "Photography Power"

Japan has many "amateur photographers and naturalists" at extremely high levels by global standards. The photos they take are as high-definition as academic specimens and artistic. IKIMON aims to build a "museum-grade dataset" with low noise, leveraging "Japan's photography skills." An approach emphasizing "quality" not just quantity.

2. Validation First

AI suggestions are kept as just "assistance," and a flow is established where "trustworthy human eyes (experts, experienced users)" must always be involved to confirm data. By attaching a trust score of "who identified it" to data, filtering like "use only reliability-A images" becomes possible during AI training, preventing mislearning.

3. "Life Cycle" Dataset

We create AI that understands not just "species name" but the "state" of that organism.
  • Multi-stage Learning: Label and train each stage from egg → larva → pupa → adult.
  • Field Sign Learning: Droppings, feeding marks, burrows—traces "other than the organism itself" also become training targets.
This enables AI to do reasoning that only veteran nature observers could do, like "identifying the culprit insect from feeding marks on a leaf."

Future Vision: Building the Large Nature Model (LNM)

What IKIMON aims for is a "multimodal foundation model of the natural world" combining images and language.

  • Input: A photo of a "mystery larva" taken with a smartphone, plus GPS information (location, time).
  • AI's Thinking
1. Image Analysis: "This has features of a swallowtail butterfly larva" 2. Geographic Reasoning: "In this place (Hokkaido), at this time (October), only XX can be seen" 3. Ecological Reasoning: "The leaf being eaten is Amur Cork Tree, so it's likely the Papilio maackii"
  • Output: "This is a Papilio maackii larva. It will soon become a chrysalis. Is there an Amur Cork Tree nearby?"
Only when we can do this can children enjoy nature "like a game." Because Japan is both a biodiversity hotspot and a technology powerhouse, we have the duty and opportunity to create this "ultimate biodiversity AI."

The World IKIMON Aims For

"I'm not an expert, so I don't understand nature."

We want to break that wall with technology.

Photograph a bug found during a walk. AI tells you "maybe this." An expert says "that's right." Data gathered this way becomes valuable scientific data to protect Japan's nature.

A society where everyone can be a "discoverer." That is the future IKIMON wants to create.

Prev
2 / 6
Next