Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux: Kernel

Accounting problem of MIGRATE_ISOLATED freed page

 

 

First page Previous page 1 2 Next page Last page  View All Linux kernel RSS feed   Index | Next | Previous | View Threaded


minchan at kernel

Jun 19, 2012, 11:12 PM

Post #1 of 30 (628 views)
Permalink
Accounting problem of MIGRATE_ISOLATED freed page

Hi Aaditya,

I want to discuss this problem on another thread.

On 06/19/2012 10:18 PM, Aaditya Kumar wrote:
> On Mon, Jun 18, 2012 at 6:13 AM, Minchan Kim <minchan [at] kernel> wrote:
>> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
>>
>>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan [at] kernel> wrote:
>>>
>>>>>
>>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>>> pgdat_balanced()
>>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>>> normal zone has no reclaimable page.
>>>>>
>>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>>> sleep only if every zones have much free pages than high water mark
>>>>> _and_ 25% of present pages in node are free.
>>>>>
>>>>
>>>>
>>>> Sorry. I can't understand your point.
>>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>>> It seems I am missing your point.
>>>> Please anybody correct me.
>>>
>>> Since currently direct reclaim is given up based on
>>> zone->all_unreclaimable flag,
>>> so for e.g in one of the scenarios:
>>>
>>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>>> hot-remove the all the pages of the MOVABLE zone.
>>>
>>> While migrating pages during memory hot-unplugging, the allocation function
>>> (for new page to which the page in MOVABLE zone would be moved) can end up
>>> looping in direct reclaim path for ever.
>>>
>>> This is so because when most of the pages in the MOVABLE zone have
>>> been migrated,
>>> the zone now contains lots of free memory (basically above low watermark)
>>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>>
>>> So kswapd() would not balance this zone as free pages are above low watermark
>>> (but all are in isolate list). So zone->all_unreclaimable flag would
>>> never be set for this zone
>>> and allocation function would end up looping forever. (assuming the
>>> zone NORMAL is
>>> left with no reclaimable memory)
>>>
>>
>>
>> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
>> But I don't see it's a problem of kswapd.
>
> Hi Kim,

I like called Minchan rather than Kim
Never mind. :)

>
> Yes I agree it is not a problem of kswapd.

Yeb.

>
>> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
>> but we can't allocate it. :(
>> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
>> Kswapd is just one of them confused.
>> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too.
>>
>> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
>
>
> I assume that by the inconsistency you mention above, you mean
> temporary inconsistency.
>
> Sorry, but IMHO as for memory hot plug the main issue with this patch
> is that the inconsistency you mentioned above would NOT be a temporary
> inconsistency.
>
> Every time say 'x' number of page frames are off lined, they will
> introduce a difference of 'x' pages between
> NR_FREE_PAGES and SumOf[free_area[order].nr_free].
> (So for e.g. if we do a frequent offline/online it will make
> NR_FREE_PAGES negative)
>
> This is so because, unset_migratetype_isolate() is called from
> offlining code (to set the migrate type of off lined pages again back
> to MIGRATE_MOVABLE)
> after the pages have been off lined and removed from the buddy list.
> Since the pages for which unset_migratetype_isolate() is called are
> not buddy pages so move_freepages_block() does not move any page, and
> thus introducing a permanent inconsistency.

Good point. Negative NR_FREE_PAGES is caused by double counting by my patch and __offline_isolated_pages.
I think at first MIGRATE_ISOLATE type freed page shouldn't account as free page.

>
>> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
>> free_area[order].nr_free exactly.
>>
>> Any thought?
>
> As for fixing move_freepages_block(), At least for memory hot plug,
> the pages stay in MIGRATE_ISOLATE list only for duration
> offline_pages() function,
> I mean only temporarily. Since fixing move_freepages_block() for will
> introduce some overhead, So I am not very sure whether that overhead
> is justified
> for a temporary condition. What do you think?

Yes. I don't like hurt fast path, either.
How about this? (Passed just compile test :( )
The patch's goal is to NOT increase nr_free and NR_FREE_PAGES about freed page into MIGRATE_ISOLATED.

This patch hurts high order page free path but I think it's not critical because higher order allocation
is rare than order-0 allocation and we already have done same thing on free_hot_cold_page on order-0 free path
which is more hot.

Maybe below patch is completed malformed. I can't inline the code at the office. Sorry.
Instead, I will attach the patch.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4403009..d2a515d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -676,6 +676,24 @@ static void free_pcppages_bulk(struct zone *zone, int count,
spin_unlock(&zone->lock);
}

+/*
+ * This function is almost same with free_one_page except that it
+ * doesn't increase NR_FREE_PAGES and free_area[order].nr_free.
+ * Because page allocator can't allocate MIGRATE_ISOLATE type page.
+ *
+ * Caller should hold zone->lock.
+ */
+static void free_one_isolated_page(struct zone *zone, struct page *page,
+ int order)
+{
+ zone->all_unreclaimable = 0;
+ zone->pages_scanned = 0;
+
+ __free_one_page(page, zone, order, MIGRATE_ISOLATE);
+ /* rollback nr_free increased by __free_one_page */
+ zone->free_area[order].nr_free--;
+}
+
static void free_one_page(struct zone *zone, struct page *page, int order,
int migratetype)
{
@@ -683,6 +701,13 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
zone->all_unreclaimable = 0;
zone->pages_scanned = 0;

+ /*
+ * Freed MIGRATE_ISOLATE page should be free_one_isolated_page path
+ * because page allocator don't want to increase NR_FREE_PAGES and
+ * free_area[order].nr_free.
+ */
+ VM_BUG_ON(migratetype == MIGRATE_ISOLATE);
+
__free_one_page(page, zone, order, migratetype);
__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
spin_unlock(&zone->lock);
@@ -718,6 +743,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
{
unsigned long flags;
int wasMlocked = __TestClearPageMlocked(page);
+ int migratetype;

if (!free_pages_prepare(page, order))
return;
@@ -726,8 +752,21 @@ static void __free_pages_ok(struct page *page, unsigned int order)
if (unlikely(wasMlocked))
free_page_mlock(page);
__count_vm_events(PGFREE, 1 << order);
- free_one_page(page_zone(page), page, order,
- get_pageblock_migratetype(page));
+ migratetype = get_pageblock_migratetype(page);
+ /*
+ * High order page alloc/free is rare compared to
+ * order-0. So this condition check should be not
+ * critical about performance.
+ */
+ if (unlikely(migratetype == MIGRATE_ISOLATE)) {
+ struct zone *zone = page_zone(page);
+ spin_lock(&zone->lock);
+ free_one_isolated_page(zone, page, order);
+ spin_unlock(&zone->lock);
+ }
+ else {
+ free_one_page(page_zone(page), page, order, migratetype);
+ }
local_irq_restore(flags);
}

@@ -906,6 +945,55 @@ static int fallbacks[MIGRATE_TYPES][4] = {
[MIGRATE_ISOLATE] = { MIGRATE_RESERVE }, /* Never used */
};

+static int hotplug_move_freepages(struct zone *zone,
+ struct page *start_page, struct page *end_page,
+ int from_migratetype, int to_migratetype)
+{
+ struct page *page;
+ unsigned long order;
+ int pages_moved = 0;
+
+#ifndef CONFIG_HOLES_IN_ZONE
+ /*
+ * page_zone is not safe to call in this context when
+ * CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant
+ * anyway as we check zone boundaries in move_freepages_block().
+ * Remove at a later date when no bug reports exist related to
+ * grouping pages by mobility
+ */
+ BUG_ON(page_zone(start_page) != page_zone(end_page));
+#endif
+
+ BUG_ON(from_migratetype == to_migratetype);
+
+ for (page = start_page; page <= end_page;) {
+ /* Make sure we are not inadvertently changing nodes */
+ VM_BUG_ON(page_to_nid(page) != zone_to_nid(zone));
+
+ if (!pfn_valid_within(page_to_pfn(page))) {
+ page++;
+ continue;
+ }
+
+ if (!PageBuddy(page)) {
+ page++;
+ continue;
+ }
+
+ order = page_order(page);
+ list_move(&page->lru,
+ &zone->free_area[order].free_list[to_migratetype]);
+ if (to_migratetype == MIGRATE_ISOLATE)
+ zone->free_area[order].nr_free--;
+ else if (from_migratetype == MIGRATE_ISOLATE)
+ zone->free_area[order].nr_free++;
+ page += 1 << order;
+ pages_moved += 1 << order;
+ }
+
+ return pages_moved;
+}
+
/*
* Move the free pages in a range to the free lists of the requested type.
* Note that start_page and end_pages are not aligned on a pageblock
@@ -954,6 +1042,32 @@ static int move_freepages(struct zone *zone,
return pages_moved;
}

+/*
+ * It's almost same with move_freepages_block except [from, to] migratetype.
+ * We need it for accounting zone->free_area[order].nr_free exactly.
+ */
+static int hotplug_move_freepages_block(struct zone *zone, struct page *page,
+ int from_migratetype, int to_migratetype)
+{
+ unsigned long start_pfn, end_pfn;
+ struct page *start_page, *end_page;
+
+ start_pfn = page_to_pfn(page);
+ start_pfn = start_pfn & ~(pageblock_nr_pages-1);
+ start_page = pfn_to_page(start_pfn);
+ end_page = start_page + pageblock_nr_pages - 1;
+ end_pfn = start_pfn + pageblock_nr_pages - 1;
+
+ /* Do not cross zone boundaries */
+ if (start_pfn < zone->zone_start_pfn)
+ start_page = page;
+ if (end_pfn >= zone->zone_start_pfn + zone->spanned_pages)
+ return 0;
+
+ return hotplug_move_freepages(zone, start_page, end_page,
+ from_migratetype, to_migratetype);
+}
+
static int move_freepages_block(struct zone *zone, struct page *page,
int migratetype)
{
@@ -1311,7 +1425,9 @@ void free_hot_cold_page(struct page *page, int cold)
*/
if (migratetype >= MIGRATE_PCPTYPES) {
if (unlikely(migratetype == MIGRATE_ISOLATE)) {
- free_one_page(zone, page, 0, migratetype);
+ spin_lock(&zone->lock);
+ free_one_isolated_page(zone, page, 0);
+ spin_unlock(&zone->lock);
goto out;
}
migratetype = MIGRATE_MOVABLE;
@@ -1388,6 +1504,7 @@ int split_free_page(struct page *page)
unsigned int order;
unsigned long watermark;
struct zone *zone;
+ int migratetype;

BUG_ON(!PageBuddy(page));

@@ -1400,10 +1517,17 @@ int split_free_page(struct page *page)
return 0;

/* Remove page from free list */
+ migratetype = get_pageblock_migratetype(page);
list_del(&page->lru);
- zone->free_area[order].nr_free--;
+ /*
+ * Page allocator didn't increase nr_free and NR_FREE_PAGES on pages
+ * which are in free_area[order].free_list[MIGRATE_ISOLATE] pages.
+ */
+ if (migratetype != MIGRATE_ISOLATE) {
+ zone->free_area[order].nr_free--;
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
+ }
rmv_page_order(page);
- __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));

/* Split into individual pages */
set_page_refcounted(page);
@@ -5593,8 +5717,11 @@ int set_migratetype_isolate(struct page *page)

out:
if (!ret) {
+ int pages_moved;
set_pageblock_migratetype(page, MIGRATE_ISOLATE);
- move_freepages_block(zone, page, MIGRATE_ISOLATE);
+ pages_moved = hotplug_move_freepages_block(zone, page,
+ MIGRATE_MOVABLE, MIGRATE_ISOLATE);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved);
}

spin_unlock_irqrestore(&zone->lock, flags);
@@ -5607,12 +5734,15 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
{
struct zone *zone;
unsigned long flags;
+ int pages_moved;
zone = page_zone(page);
spin_lock_irqsave(&zone->lock, flags);
if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
goto out;
set_pageblock_migratetype(page, migratetype);
- move_freepages_block(zone, page, migratetype);
+ pages_moved = hotplug_move_freepages_block(zone, page,
+ MIGRATE_ISOLATE, migratetype);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved);
out:
spin_unlock_irqrestore(&zone->lock, flags);
}
@@ -5900,9 +6030,6 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
#endif
list_del(&page->lru);
rmv_page_order(page);
- zone->free_area[order].nr_free--;
- __mod_zone_page_state(zone, NR_FREE_PAGES,
- - (1UL << order));
for (i = 0; i < (1 << order); i++)
SetPageReserved((page+i));
pfn += (1 << order);






--
Kind regards,
Minchan Kim
Attachments: patch.patch (7.53 KB)


kosaki.motohiro at gmail

Jun 19, 2012, 11:32 PM

Post #2 of 30 (613 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

(6/20/12 2:12 AM), Minchan Kim wrote:
>
> Hi Aaditya,
>
> I want to discuss this problem on another thread.
>
> On 06/19/2012 10:18 PM, Aaditya Kumar wrote:
>> On Mon, Jun 18, 2012 at 6:13 AM, Minchan Kim <minchan [at] kernel> wrote:
>>> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
>>>
>>>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan [at] kernel> wrote:
>>>>
>>>>>>
>>>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>>>> pgdat_balanced()
>>>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>>>> normal zone has no reclaimable page.
>>>>>>
>>>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>>>> sleep only if every zones have much free pages than high water mark
>>>>>> _and_ 25% of present pages in node are free.
>>>>>>
>>>>>
>>>>>
>>>>> Sorry. I can't understand your point.
>>>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>>>> It seems I am missing your point.
>>>>> Please anybody correct me.
>>>>
>>>> Since currently direct reclaim is given up based on
>>>> zone->all_unreclaimable flag,
>>>> so for e.g in one of the scenarios:
>>>>
>>>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>>>> hot-remove the all the pages of the MOVABLE zone.
>>>>
>>>> While migrating pages during memory hot-unplugging, the allocation function
>>>> (for new page to which the page in MOVABLE zone would be moved) can end up
>>>> looping in direct reclaim path for ever.
>>>>
>>>> This is so because when most of the pages in the MOVABLE zone have
>>>> been migrated,
>>>> the zone now contains lots of free memory (basically above low watermark)
>>>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>>>
>>>> So kswapd() would not balance this zone as free pages are above low watermark
>>>> (but all are in isolate list). So zone->all_unreclaimable flag would
>>>> never be set for this zone
>>>> and allocation function would end up looping forever. (assuming the
>>>> zone NORMAL is
>>>> left with no reclaimable memory)
>>>>
>>>
>>>
>>> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
>>> But I don't see it's a problem of kswapd.
>>
>> Hi Kim,
>
> I like called Minchan rather than Kim
> Never mind. :)
>
>>
>> Yes I agree it is not a problem of kswapd.
>
> Yeb.
>
>>
>>> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
>>> but we can't allocate it. :(
>>> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
>>> Kswapd is just one of them confused.
>>> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too.
>>>
>>> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
>>
>>
>> I assume that by the inconsistency you mention above, you mean
>> temporary inconsistency.
>>
>> Sorry, but IMHO as for memory hot plug the main issue with this patch
>> is that the inconsistency you mentioned above would NOT be a temporary
>> inconsistency.
>>
>> Every time say 'x' number of page frames are off lined, they will
>> introduce a difference of 'x' pages between
>> NR_FREE_PAGES and SumOf[free_area[order].nr_free].
>> (So for e.g. if we do a frequent offline/online it will make
>> NR_FREE_PAGES negative)
>>
>> This is so because, unset_migratetype_isolate() is called from
>> offlining code (to set the migrate type of off lined pages again back
>> to MIGRATE_MOVABLE)
>> after the pages have been off lined and removed from the buddy list.
>> Since the pages for which unset_migratetype_isolate() is called are
>> not buddy pages so move_freepages_block() does not move any page, and
>> thus introducing a permanent inconsistency.
>
> Good point. Negative NR_FREE_PAGES is caused by double counting by my patch and __offline_isolated_pages.
> I think at first MIGRATE_ISOLATE type freed page shouldn't account as free page.
>
>>
>>> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
>>> free_area[order].nr_free exactly.
>>>
>>> Any thought?
>>
>> As for fixing move_freepages_block(), At least for memory hot plug,
>> the pages stay in MIGRATE_ISOLATE list only for duration
>> offline_pages() function,
>> I mean only temporarily. Since fixing move_freepages_block() for will
>> introduce some overhead, So I am not very sure whether that overhead
>> is justified
>> for a temporary condition. What do you think?
>
> Yes. I don't like hurt fast path, either.
> How about this? (Passed just compile test :( )
> The patch's goal is to NOT increase nr_free and NR_FREE_PAGES about freed page into MIGRATE_ISOLATED.
>
> This patch hurts high order page free path but I think it's not critical because higher order allocation
> is rare than order-0 allocation and we already have done same thing on free_hot_cold_page on order-0 free path
> which is more hot.

Can't we change zone_water_mark_ok_safe() instead of page allocator? memory hotplug is really rare event.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


minchan at kernel

Jun 20, 2012, 12:53 AM

Post #3 of 30 (611 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

On 06/20/2012 03:32 PM, KOSAKI Motohiro wrote:

> (6/20/12 2:12 AM), Minchan Kim wrote:
>>
>> Hi Aaditya,
>>
>> I want to discuss this problem on another thread.
>>
>> On 06/19/2012 10:18 PM, Aaditya Kumar wrote:
>>> On Mon, Jun 18, 2012 at 6:13 AM, Minchan Kim <minchan [at] kernel> wrote:
>>>> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
>>>>
>>>>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan [at] kernel> wrote:
>>>>>
>>>>>>>
>>>>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>>>>> pgdat_balanced()
>>>>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>>>>> normal zone has no reclaimable page.
>>>>>>>
>>>>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>>>>> sleep only if every zones have much free pages than high water mark
>>>>>>> _and_ 25% of present pages in node are free.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Sorry. I can't understand your point.
>>>>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>>>>> It seems I am missing your point.
>>>>>> Please anybody correct me.
>>>>>
>>>>> Since currently direct reclaim is given up based on
>>>>> zone->all_unreclaimable flag,
>>>>> so for e.g in one of the scenarios:
>>>>>
>>>>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>>>>> hot-remove the all the pages of the MOVABLE zone.
>>>>>
>>>>> While migrating pages during memory hot-unplugging, the allocation function
>>>>> (for new page to which the page in MOVABLE zone would be moved) can end up
>>>>> looping in direct reclaim path for ever.
>>>>>
>>>>> This is so because when most of the pages in the MOVABLE zone have
>>>>> been migrated,
>>>>> the zone now contains lots of free memory (basically above low watermark)
>>>>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>>>>
>>>>> So kswapd() would not balance this zone as free pages are above low watermark
>>>>> (but all are in isolate list). So zone->all_unreclaimable flag would
>>>>> never be set for this zone
>>>>> and allocation function would end up looping forever. (assuming the
>>>>> zone NORMAL is
>>>>> left with no reclaimable memory)
>>>>>
>>>>
>>>>
>>>> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
>>>> But I don't see it's a problem of kswapd.
>>>
>>> Hi Kim,
>>
>> I like called Minchan rather than Kim
>> Never mind. :)
>>
>>>
>>> Yes I agree it is not a problem of kswapd.
>>
>> Yeb.
>>
>>>
>>>> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
>>>> but we can't allocate it. :(
>>>> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
>>>> Kswapd is just one of them confused.
>>>> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too.
>>>>
>>>> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
>>>
>>>
>>> I assume that by the inconsistency you mention above, you mean
>>> temporary inconsistency.
>>>
>>> Sorry, but IMHO as for memory hot plug the main issue with this patch
>>> is that the inconsistency you mentioned above would NOT be a temporary
>>> inconsistency.
>>>
>>> Every time say 'x' number of page frames are off lined, they will
>>> introduce a difference of 'x' pages between
>>> NR_FREE_PAGES and SumOf[free_area[order].nr_free].
>>> (So for e.g. if we do a frequent offline/online it will make
>>> NR_FREE_PAGES negative)
>>>
>>> This is so because, unset_migratetype_isolate() is called from
>>> offlining code (to set the migrate type of off lined pages again back
>>> to MIGRATE_MOVABLE)
>>> after the pages have been off lined and removed from the buddy list.
>>> Since the pages for which unset_migratetype_isolate() is called are
>>> not buddy pages so move_freepages_block() does not move any page, and
>>> thus introducing a permanent inconsistency.
>>
>> Good point. Negative NR_FREE_PAGES is caused by double counting by my patch and __offline_isolated_pages.
>> I think at first MIGRATE_ISOLATE type freed page shouldn't account as free page.
>>
>>>
>>>> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
>>>> free_area[order].nr_free exactly.
>>>>
>>>> Any thought?
>>>
>>> As for fixing move_freepages_block(), At least for memory hot plug,
>>> the pages stay in MIGRATE_ISOLATE list only for duration
>>> offline_pages() function,
>>> I mean only temporarily. Since fixing move_freepages_block() for will
>>> introduce some overhead, So I am not very sure whether that overhead
>>> is justified
>>> for a temporary condition. What do you think?
>>
>> Yes. I don't like hurt fast path, either.
>> How about this? (Passed just compile test :( )
>> The patch's goal is to NOT increase nr_free and NR_FREE_PAGES about freed page into MIGRATE_ISOLATED.
>>
>> This patch hurts high order page free path but I think it's not critical because higher order allocation
>> is rare than order-0 allocation and we already have done same thing on free_hot_cold_page on order-0 free path
>> which is more hot.
>
> Can't we change zone_water_mark_ok_safe() instead of page allocator? memory hotplug is really rare event.


+1

Firstly, I want to make zone_page_state(z, NR_FREE_PAGES) itself more accurately because it is used by
several places. As I looked over places, I can't find critical places except kswapd forever sleep case.
So it's a nice idea!

In that case, we need zone->lock whenever zone_watermark_ok_safe is called.
Most of cases, it's unnecessary and it might hurt alloc/free performance when memory pressure is high.
But if memory pressure is high, it may be already meaningless alloc/free performance.
So it does make sense, IMHO.

Please raise your hands if anyone has a concern about this.

barrios [at] bbo:~/linux-next$ git diff
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d2a515d..82cc0a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1748,16 +1748,38 @@ bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
zone_page_state(z, NR_FREE_PAGES));
}

-bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
+bool zone_watermark_ok_safe(struct zone *z, int alloc_order, unsigned long mark,
int classzone_idx, int alloc_flags)
{
+ struct free_area *area;
+ struct list_head *curr;
+ int order;
+ unsigned long flags;
long free_pages = zone_page_state(z, NR_FREE_PAGES);

if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);

- return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
- free_pages);
+ /*
+ * Memory hotplug/CMA can isolate freed page into MIGRATE_ISOLATE
+ * so that buddy can't allocate it although they are in free list.
+ */
+ spin_lock_irqsave(&z->lock, flags);
+ for (order = 0; order < MAX_ORDER; order++) {
+ int count = 0;
+ area = &(z->free_area[order]);
+ if (unlikely(!list_empty(&area->free_list[MIGRATE_ISOLATE]))) {
+ list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
+ count++;
+ free_pages -= (count << order);
+ }
+ }
+ if (free_pages < 0)
+ free_pages = 0;
+ spin_unlock_irqrestore(&z->lock, flags);
+
+ return __zone_watermark_ok(z, alloc_order, mark,
+ classzone_idx, alloc_flags, free_pages);
}

#ifdef CONFIG_NUMA






>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo [at] kvack For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont [at] kvack"> email [at] kvack </a>
>



--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


dhillf at gmail

Jun 20, 2012, 5:44 AM

Post #4 of 30 (611 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

On Wed, Jun 20, 2012 at 3:53 PM, Minchan Kim <minchan [at] kernel> wrote:
> On 06/20/2012 03:32 PM, KOSAKI Motohiro wrote:
>
>> (6/20/12 2:12 AM), Minchan Kim wrote:
>>>
>>> Hi Aaditya,
>>>
>>> I want to discuss this problem on another thread.
>>>
>>> On 06/19/2012 10:18 PM, Aaditya Kumar wrote:
>>>> On Mon, Jun 18, 2012 at 6:13 AM, Minchan Kim <minchan [at] kernel> wrote:
>>>>> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
>>>>>
>>>>>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan [at] kernel> wrote:
>>>>>>
>>>>>>>>
>>>>>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>>>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>>>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>>>>>> pgdat_balanced()
>>>>>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>>>>>> normal zone has no reclaimable page.
>>>>>>>>
>>>>>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>>>>>> sleep only if every zones have much free pages than high water mark
>>>>>>>> _and_ 25% of present pages in node are free.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Sorry. I can't understand your point.
>>>>>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>>>>>> It seems I am missing your point.
>>>>>>> Please anybody correct me.
>>>>>>
>>>>>> Since currently direct reclaim is given up based on
>>>>>> zone->all_unreclaimable flag,
>>>>>> so for e.g in one of the scenarios:
>>>>>>
>>>>>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>>>>>> hot-remove the all the pages of the MOVABLE zone.
>>>>>>
>>>>>> While migrating pages during memory hot-unplugging, the allocation function
>>>>>> (for new page to which the page in MOVABLE zone would be moved)  can end up
>>>>>> looping in direct reclaim path for ever.
>>>>>>
>>>>>> This is so because when most of the pages in the MOVABLE zone have
>>>>>> been migrated,
>>>>>> the zone now contains lots of free memory (basically above low watermark)
>>>>>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>>>>>
>>>>>> So kswapd() would not balance this zone as free pages are above low watermark
>>>>>> (but all are in isolate list). So zone->all_unreclaimable flag would
>>>>>> never be set for this zone
>>>>>> and allocation function would end up looping forever. (assuming the
>>>>>> zone NORMAL is
>>>>>> left with no reclaimable memory)
>>>>>>
>>>>>
>>>>>
>>>>> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
>>>>> But I don't see it's a problem of kswapd.
>>>>
>>>> Hi Kim,
>>>
>>> I like called Minchan rather than Kim
>>> Never mind. :)
>>>
>>>>
>>>> Yes I agree it is not a problem of kswapd.
>>>
>>> Yeb.
>>>
>>>>
>>>>> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
>>>>> but we can't allocate it. :(
>>>>> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
>>>>> Kswapd is just one of them confused.
>>>>> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too.
>>>>>
>>>>> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
>>>>
>>>>
>>>> I assume that by the inconsistency you mention above, you mean
>>>> temporary inconsistency.
>>>>
>>>> Sorry, but IMHO as for memory hot plug the main issue with this patch
>>>> is that the inconsistency you mentioned above would NOT be a temporary
>>>> inconsistency.
>>>>
>>>> Every time say 'x' number of page frames are off lined, they will
>>>> introduce a difference of 'x' pages between
>>>> NR_FREE_PAGES and SumOf[free_area[order].nr_free].
>>>> (So for e.g. if we do a frequent offline/online it will make
>>>> NR_FREE_PAGES  negative)
>>>>
>>>> This is so because, unset_migratetype_isolate() is called from
>>>> offlining  code (to set the migrate type of off lined pages again back
>>>> to MIGRATE_MOVABLE)
>>>> after the pages have been off lined and removed from the buddy list.
>>>> Since the pages for which unset_migratetype_isolate() is called are
>>>> not buddy pages so move_freepages_block() does not move any page, and
>>>> thus introducing a permanent inconsistency.
>>>
>>> Good point. Negative NR_FREE_PAGES is caused by double counting by my patch and __offline_isolated_pages.
>>> I think at first MIGRATE_ISOLATE type freed page shouldn't account as free page.
>>>
>>>>
>>>>> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
>>>>> free_area[order].nr_free exactly.
>>>>>
>>>>> Any thought?
>>>>
>>>> As for fixing move_freepages_block(), At least for memory hot plug,
>>>> the pages stay in MIGRATE_ISOLATE list only for duration
>>>> offline_pages() function,
>>>> I mean only temporarily. Since fixing move_freepages_block() for will
>>>> introduce some overhead, So I am not very sure whether that overhead
>>>> is justified
>>>> for a temporary condition. What do you think?
>>>
>>> Yes. I don't like hurt fast path, either.
>>> How about this? (Passed just compile test :(  )
>>> The patch's goal is to NOT increase nr_free and NR_FREE_PAGES about freed page into MIGRATE_ISOLATED.
>>>
>>> This patch hurts high order page free path but I think it's not critical because higher order allocation
>>> is rare than order-0 allocation and we already have done same thing on free_hot_cold_page on order-0 free path
>>> which is more hot.
>>
>> Can't we change zone_water_mark_ok_safe() instead of page allocator? memory hotplug is really rare event.
>
>
> +1
>
> Firstly, I want to make zone_page_state(z, NR_FREE_PAGES) itself more accurately because it is used by
> several places. As I looked over places, I can't find critical places except kswapd forever sleep case.
> So it's a nice idea!
>
> In that case, we need zone->lock whenever zone_watermark_ok_safe is called.
>
Ifdefinery could be utilized for builds with CMA disabled, first.

> Most of cases, it's unnecessary and it might hurt alloc/free performance when memory pressure is high.
> But if memory pressure is high, it may be already meaningless alloc/free performance.
> So it does make sense, IMHO.
>
> Please raise your hands if anyone has a concern about this.
>
> barrios [at] bbo:~/linux-next$ git diff
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d2a515d..82cc0a2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1748,16 +1748,38 @@ bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
>                                        zone_page_state(z, NR_FREE_PAGES));
>  }
>
> -bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
> +bool zone_watermark_ok_safe(struct zone *z, int alloc_order, unsigned long mark,
>                      int classzone_idx, int alloc_flags)
>  {
> +       struct free_area *area;
> +       struct list_head *curr;
> +       int order;
> +       unsigned long flags;
>        long free_pages = zone_page_state(z, NR_FREE_PAGES);
>
>        if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
>                free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
>
> -       return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
> -                                                               free_pages);
> +       /*
> +        * Memory hotplug/CMA can isolate freed page into MIGRATE_ISOLATE
> +        * so that buddy can't allocate it although they are in free list.
> +        */
> +       spin_lock_irqsave(&z->lock, flags);
> +       for (order = 0; order < MAX_ORDER; order++) {
> +               int count = 0;
> +               area = &(z->free_area[order]);
> +               if (unlikely(!list_empty(&area->free_list[MIGRATE_ISOLATE]))) {
> +                       list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
> +                               count++;
> +                       free_pages -= (count << order);
> +               }
> +       }
> +       if (free_pages < 0)
> +               free_pages = 0;
> +       spin_unlock_irqrestore(&z->lock, flags);
> +
> +       return __zone_watermark_ok(z, alloc_order, mark,
> +                               classzone_idx, alloc_flags, free_pages);
>  }
>
Then isolated pages could be scanned in another direction?

spin_lock_irqsave(&z->lock, flags);
for (order = MAX_ORDER - 1; order >= 0; order--) {
struct free_area *area = &z->free_area[order];
long count = 0;
struct list_head *curr;

list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
count++;

free_pages -= (count << order);
if (free_pages < 0) {
free_pages = 0;
break;
}
}
spin_unlock_irqrestore(&z->lock, flags);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kosaki.motohiro at gmail

Jun 20, 2012, 1:19 PM

Post #5 of 30 (609 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

(6/20/12 3:53 AM), Minchan Kim wrote:
> On 06/20/2012 03:32 PM, KOSAKI Motohiro wrote:
>
>> (6/20/12 2:12 AM), Minchan Kim wrote:
>>>
>>> Hi Aaditya,
>>>
>>> I want to discuss this problem on another thread.
>>>
>>> On 06/19/2012 10:18 PM, Aaditya Kumar wrote:
>>>> On Mon, Jun 18, 2012 at 6:13 AM, Minchan Kim <minchan [at] kernel> wrote:
>>>>> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
>>>>>
>>>>>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan [at] kernel> wrote:
>>>>>>
>>>>>>>>
>>>>>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>>>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>>>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>>>>>> pgdat_balanced()
>>>>>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>>>>>> normal zone has no reclaimable page.
>>>>>>>>
>>>>>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>>>>>> sleep only if every zones have much free pages than high water mark
>>>>>>>> _and_ 25% of present pages in node are free.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Sorry. I can't understand your point.
>>>>>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>>>>>> It seems I am missing your point.
>>>>>>> Please anybody correct me.
>>>>>>
>>>>>> Since currently direct reclaim is given up based on
>>>>>> zone->all_unreclaimable flag,
>>>>>> so for e.g in one of the scenarios:
>>>>>>
>>>>>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>>>>>> hot-remove the all the pages of the MOVABLE zone.
>>>>>>
>>>>>> While migrating pages during memory hot-unplugging, the allocation function
>>>>>> (for new page to which the page in MOVABLE zone would be moved) can end up
>>>>>> looping in direct reclaim path for ever.
>>>>>>
>>>>>> This is so because when most of the pages in the MOVABLE zone have
>>>>>> been migrated,
>>>>>> the zone now contains lots of free memory (basically above low watermark)
>>>>>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>>>>>
>>>>>> So kswapd() would not balance this zone as free pages are above low watermark
>>>>>> (but all are in isolate list). So zone->all_unreclaimable flag would
>>>>>> never be set for this zone
>>>>>> and allocation function would end up looping forever. (assuming the
>>>>>> zone NORMAL is
>>>>>> left with no reclaimable memory)
>>>>>>
>>>>>
>>>>>
>>>>> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
>>>>> But I don't see it's a problem of kswapd.
>>>>
>>>> Hi Kim,
>>>
>>> I like called Minchan rather than Kim
>>> Never mind. :)
>>>
>>>>
>>>> Yes I agree it is not a problem of kswapd.
>>>
>>> Yeb.
>>>
>>>>
>>>>> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
>>>>> but we can't allocate it. :(
>>>>> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
>>>>> Kswapd is just one of them confused.
>>>>> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too.
>>>>>
>>>>> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
>>>>
>>>>
>>>> I assume that by the inconsistency you mention above, you mean
>>>> temporary inconsistency.
>>>>
>>>> Sorry, but IMHO as for memory hot plug the main issue with this patch
>>>> is that the inconsistency you mentioned above would NOT be a temporary
>>>> inconsistency.
>>>>
>>>> Every time say 'x' number of page frames are off lined, they will
>>>> introduce a difference of 'x' pages between
>>>> NR_FREE_PAGES and SumOf[free_area[order].nr_free].
>>>> (So for e.g. if we do a frequent offline/online it will make
>>>> NR_FREE_PAGES negative)
>>>>
>>>> This is so because, unset_migratetype_isolate() is called from
>>>> offlining code (to set the migrate type of off lined pages again back
>>>> to MIGRATE_MOVABLE)
>>>> after the pages have been off lined and removed from the buddy list.
>>>> Since the pages for which unset_migratetype_isolate() is called are
>>>> not buddy pages so move_freepages_block() does not move any page, and
>>>> thus introducing a permanent inconsistency.
>>>
>>> Good point. Negative NR_FREE_PAGES is caused by double counting by my patch and __offline_isolated_pages.
>>> I think at first MIGRATE_ISOLATE type freed page shouldn't account as free page.
>>>
>>>>
>>>>> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
>>>>> free_area[order].nr_free exactly.
>>>>>
>>>>> Any thought?
>>>>
>>>> As for fixing move_freepages_block(), At least for memory hot plug,
>>>> the pages stay in MIGRATE_ISOLATE list only for duration
>>>> offline_pages() function,
>>>> I mean only temporarily. Since fixing move_freepages_block() for will
>>>> introduce some overhead, So I am not very sure whether that overhead
>>>> is justified
>>>> for a temporary condition. What do you think?
>>>
>>> Yes. I don't like hurt fast path, either.
>>> How about this? (Passed just compile test :( )
>>> The patch's goal is to NOT increase nr_free and NR_FREE_PAGES about freed page into MIGRATE_ISOLATED.
>>>
>>> This patch hurts high order page free path but I think it's not critical because higher order allocation
>>> is rare than order-0 allocation and we already have done same thing on free_hot_cold_page on order-0 free path
>>> which is more hot.
>>
>> Can't we change zone_water_mark_ok_safe() instead of page allocator? memory hotplug is really rare event.
>
>
> +1
>
> Firstly, I want to make zone_page_state(z, NR_FREE_PAGES) itself more accurately because it is used by
> several places. As I looked over places, I can't find critical places except kswapd forever sleep case.
> So it's a nice idea!
>
> In that case, we need zone->lock whenever zone_watermark_ok_safe is called.
> Most of cases, it's unnecessary and it might hurt alloc/free performance when memory pressure is high.
> But if memory pressure is high, it may be already meaningless alloc/free performance.
> So it does make sense, IMHO.
>
> Please raise your hands if anyone has a concern about this.
>
> barrios [at] bbo:~/linux-next$ git diff
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d2a515d..82cc0a2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1748,16 +1748,38 @@ bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> zone_page_state(z, NR_FREE_PAGES));
> }
>
> -bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
> +bool zone_watermark_ok_safe(struct zone *z, int alloc_order, unsigned long mark,
> int classzone_idx, int alloc_flags)
> {
> + struct free_area *area;
> + struct list_head *curr;
> + int order;
> + unsigned long flags;
> long free_pages = zone_page_state(z, NR_FREE_PAGES);
>
> if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
>
> - return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
> - free_pages);
> + /*
> + * Memory hotplug/CMA can isolate freed page into MIGRATE_ISOLATE
> + * so that buddy can't allocate it although they are in free list.
> + */
> + spin_lock_irqsave(&z->lock, flags);
> + for (order = 0; order < MAX_ORDER; order++) {
> + int count = 0;
> + area = &(z->free_area[order]);
> + if (unlikely(!list_empty(&area->free_list[MIGRATE_ISOLATE]))) {
> + list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
> + count++;
> + free_pages -= (count << order);
> + }
> + }
> + if (free_pages < 0)
> + free_pages = 0;
> + spin_unlock_irqrestore(&z->lock, flags);
> +
> + return __zone_watermark_ok(z, alloc_order, mark,
> + classzone_idx, alloc_flags, free_pages);
> }

number of isolate page block is almost always 0. then if we have such counter,
we almost always can avoid zone->lock. Just idea.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


minchan at kernel

Jun 20, 2012, 4:58 PM

Post #6 of 30 (608 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

On 06/20/2012 09:44 PM, Hillf Danton wrote:

> On Wed, Jun 20, 2012 at 3:53 PM, Minchan Kim <minchan [at] kernel> wrote:
>> On 06/20/2012 03:32 PM, KOSAKI Motohiro wrote:
>>
>>> (6/20/12 2:12 AM), Minchan Kim wrote:
>>>>
>>>> Hi Aaditya,
>>>>
>>>> I want to discuss this problem on another thread.
>>>>
>>>> On 06/19/2012 10:18 PM, Aaditya Kumar wrote:
>>>>> On Mon, Jun 18, 2012 at 6:13 AM, Minchan Kim <minchan [at] kernel> wrote:
>>>>>> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
>>>>>>
>>>>>>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan [at] kernel> wrote:
>>>>>>>
>>>>>>>>>
>>>>>>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>>>>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>>>>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>>>>>>> pgdat_balanced()
>>>>>>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>>>>>>> normal zone has no reclaimable page.
>>>>>>>>>
>>>>>>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>>>>>>> sleep only if every zones have much free pages than high water mark
>>>>>>>>> _and_ 25% of present pages in node are free.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Sorry. I can't understand your point.
>>>>>>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>>>>>>> It seems I am missing your point.
>>>>>>>> Please anybody correct me.
>>>>>>>
>>>>>>> Since currently direct reclaim is given up based on
>>>>>>> zone->all_unreclaimable flag,
>>>>>>> so for e.g in one of the scenarios:
>>>>>>>
>>>>>>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>>>>>>> hot-remove the all the pages of the MOVABLE zone.
>>>>>>>
>>>>>>> While migrating pages during memory hot-unplugging, the allocation function
>>>>>>> (for new page to which the page in MOVABLE zone would be moved) can end up
>>>>>>> looping in direct reclaim path for ever.
>>>>>>>
>>>>>>> This is so because when most of the pages in the MOVABLE zone have
>>>>>>> been migrated,
>>>>>>> the zone now contains lots of free memory (basically above low watermark)
>>>>>>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>>>>>>
>>>>>>> So kswapd() would not balance this zone as free pages are above low watermark
>>>>>>> (but all are in isolate list). So zone->all_unreclaimable flag would
>>>>>>> never be set for this zone
>>>>>>> and allocation function would end up looping forever. (assuming the
>>>>>>> zone NORMAL is
>>>>>>> left with no reclaimable memory)
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
>>>>>> But I don't see it's a problem of kswapd.
>>>>>
>>>>> Hi Kim,
>>>>
>>>> I like called Minchan rather than Kim
>>>> Never mind. :)
>>>>
>>>>>
>>>>> Yes I agree it is not a problem of kswapd.
>>>>
>>>> Yeb.
>>>>
>>>>>
>>>>>> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
>>>>>> but we can't allocate it. :(
>>>>>> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
>>>>>> Kswapd is just one of them confused.
>>>>>> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too.
>>>>>>
>>>>>> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
>>>>>
>>>>>
>>>>> I assume that by the inconsistency you mention above, you mean
>>>>> temporary inconsistency.
>>>>>
>>>>> Sorry, but IMHO as for memory hot plug the main issue with this patch
>>>>> is that the inconsistency you mentioned above would NOT be a temporary
>>>>> inconsistency.
>>>>>
>>>>> Every time say 'x' number of page frames are off lined, they will
>>>>> introduce a difference of 'x' pages between
>>>>> NR_FREE_PAGES and SumOf[free_area[order].nr_free].
>>>>> (So for e.g. if we do a frequent offline/online it will make
>>>>> NR_FREE_PAGES negative)
>>>>>
>>>>> This is so because, unset_migratetype_isolate() is called from
>>>>> offlining code (to set the migrate type of off lined pages again back
>>>>> to MIGRATE_MOVABLE)
>>>>> after the pages have been off lined and removed from the buddy list.
>>>>> Since the pages for which unset_migratetype_isolate() is called are
>>>>> not buddy pages so move_freepages_block() does not move any page, and
>>>>> thus introducing a permanent inconsistency.
>>>>
>>>> Good point. Negative NR_FREE_PAGES is caused by double counting by my patch and __offline_isolated_pages.
>>>> I think at first MIGRATE_ISOLATE type freed page shouldn't account as free page.
>>>>
>>>>>
>>>>>> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
>>>>>> free_area[order].nr_free exactly.
>>>>>>
>>>>>> Any thought?
>>>>>
>>>>> As for fixing move_freepages_block(), At least for memory hot plug,
>>>>> the pages stay in MIGRATE_ISOLATE list only for duration
>>>>> offline_pages() function,
>>>>> I mean only temporarily. Since fixing move_freepages_block() for will
>>>>> introduce some overhead, So I am not very sure whether that overhead
>>>>> is justified
>>>>> for a temporary condition. What do you think?
>>>>
>>>> Yes. I don't like hurt fast path, either.
>>>> How about this? (Passed just compile test :( )
>>>> The patch's goal is to NOT increase nr_free and NR_FREE_PAGES about freed page into MIGRATE_ISOLATED.
>>>>
>>>> This patch hurts high order page free path but I think it's not critical because higher order allocation
>>>> is rare than order-0 allocation and we already have done same thing on free_hot_cold_page on order-0 free path
>>>> which is more hot.
>>>
>>> Can't we change zone_water_mark_ok_safe() instead of page allocator? memory hotplug is really rare event.
>>
>>
>> +1
>>
>> Firstly, I want to make zone_page_state(z, NR_FREE_PAGES) itself more accurately because it is used by
>> several places. As I looked over places, I can't find critical places except kswapd forever sleep case.
>> So it's a nice idea!
>>
>> In that case, we need zone->lock whenever zone_watermark_ok_safe is called.
>>
> Ifdefinery could be utilized for builds with CMA disabled, first.


Yeb.
if system doesn't use CMA and MEMORY_HOTPLUG, we can avoid it.
I can do it in formal patch later.

#if defined CONFIG_CMA || CONFIG_MEMORY_HOTPLUG

Still, my concern is that I'm not sure this approach is good or not.
As I mentioned, MIGRATE_ISOLATE is very irony type because it's a type which represent NOT allocatable
but they are in _free_ list, even they increase NR_FREE_PAGES and nr_free of free_area.
So, if someone in future uses it, we have to add new CONFIG like this.

#if defined CONFIG_CMA || CONFIG_MEMORY_HOTPLUG || CONFIG_XXX

If another someone try it,

#if defined CONFIG_CMA || CONFIG_MEMORY_HOTPLUG || CONFIG_XXX || CONFIG_YYY

I know we can guess that user of MIGRATE_ISOLATE will be small. But who can make sure it?
Important thing is we provide such interface and any user who want it can use it anytime.

If users of NR_FREE_PAGES and nr_free are increased, they will get confused more than now. Sigh.
IMHO, right approach is that we shouldn't account it at free page at the beginning
although it adds a new condition check in higher order free page path.
Look at my attached patch in previous mail.

Of course, we can redesign isolated page machinery of hotplug but I have no idea.
Because we can isolate pages which are already in freelist to another list instead of free_area.
But the problem is we can't isolate pages which are going to free later, ie returned to buddy allocator
after we mark pageblock to MIGRATE_ISOLATE. For it, we should touch page free patch which is hot path, too.


>
>> Most of cases, it's unnecessary and it might hurt alloc/free performance when memory pressure is high.
>> But if memory pressure is high, it may be already meaningless alloc/free performance.
>> So it does make sense, IMHO.
>>
>> Please raise your hands if anyone has a concern about this.
>>
>> barrios [at] bbo:~/linux-next$ git diff
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index d2a515d..82cc0a2 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1748,16 +1748,38 @@ bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
>> zone_page_state(z, NR_FREE_PAGES));
>> }
>>
>> -bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
>> +bool zone_watermark_ok_safe(struct zone *z, int alloc_order, unsigned long mark,
>> int classzone_idx, int alloc_flags)
>> {
>> + struct free_area *area;
>> + struct list_head *curr;
>> + int order;
>> + unsigned long flags;
>> long free_pages = zone_page_state(z, NR_FREE_PAGES);
>>
>> if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
>> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
>>
>> - return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
>> - free_pages);
>> + /*
>> + * Memory hotplug/CMA can isolate freed page into MIGRATE_ISOLATE
>> + * so that buddy can't allocate it although they are in free list.
>> + */
>> + spin_lock_irqsave(&z->lock, flags);
>> + for (order = 0; order < MAX_ORDER; order++) {
>> + int count = 0;
>> + area = &(z->free_area[order]);
>> + if (unlikely(!list_empty(&area->free_list[MIGRATE_ISOLATE]))) {
>> + list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
>> + count++;
>> + free_pages -= (count << order);
>> + }
>> + }
>> + if (free_pages < 0)
>> + free_pages = 0;
>> + spin_unlock_irqrestore(&z->lock, flags);
>> +
>> + return __zone_watermark_ok(z, alloc_order, mark,
>> + classzone_idx, alloc_flags, free_pages);
>> }
>>
> Then isolated pages could be scanned in another direction?
>
> spin_lock_irqsave(&z->lock, flags);
> for (order = MAX_ORDER - 1; order >= 0; order--) {
> struct free_area *area = &z->free_area[order];
> long count = 0;
> struct list_head *curr;
>
> list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
> count++;
>
> free_pages -= (count << order);
> if (free_pages < 0) {
> free_pages = 0;
> break;
> }
> }
> spin_unlock_irqrestore(&z->lock, flags);

I'm not sure how it helps reducing loop enumeration but it's no problem.
Anyway, before sending formal patch, I want to discuss above my concern.

Thanks, Hillf.

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


minchan at kernel

Jun 20, 2012, 5:01 PM

Post #7 of 30 (611 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

On 06/21/2012 05:19 AM, KOSAKI Motohiro wrote:

> (6/20/12 3:53 AM), Minchan Kim wrote:
>> On 06/20/2012 03:32 PM, KOSAKI Motohiro wrote:
>>
>>> (6/20/12 2:12 AM), Minchan Kim wrote:
>>>>
>>>> Hi Aaditya,
>>>>
>>>> I want to discuss this problem on another thread.
>>>>
>>>> On 06/19/2012 10:18 PM, Aaditya Kumar wrote:
>>>>> On Mon, Jun 18, 2012 at 6:13 AM, Minchan Kim <minchan [at] kernel> wrote:
>>>>>> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
>>>>>>
>>>>>>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan [at] kernel> wrote:
>>>>>>>
>>>>>>>>>
>>>>>>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>>>>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>>>>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>>>>>>> pgdat_balanced()
>>>>>>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>>>>>>> normal zone has no reclaimable page.
>>>>>>>>>
>>>>>>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>>>>>>> sleep only if every zones have much free pages than high water mark
>>>>>>>>> _and_ 25% of present pages in node are free.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Sorry. I can't understand your point.
>>>>>>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>>>>>>> It seems I am missing your point.
>>>>>>>> Please anybody correct me.
>>>>>>>
>>>>>>> Since currently direct reclaim is given up based on
>>>>>>> zone->all_unreclaimable flag,
>>>>>>> so for e.g in one of the scenarios:
>>>>>>>
>>>>>>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>>>>>>> hot-remove the all the pages of the MOVABLE zone.
>>>>>>>
>>>>>>> While migrating pages during memory hot-unplugging, the allocation function
>>>>>>> (for new page to which the page in MOVABLE zone would be moved) can end up
>>>>>>> looping in direct reclaim path for ever.
>>>>>>>
>>>>>>> This is so because when most of the pages in the MOVABLE zone have
>>>>>>> been migrated,
>>>>>>> the zone now contains lots of free memory (basically above low watermark)
>>>>>>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>>>>>>
>>>>>>> So kswapd() would not balance this zone as free pages are above low watermark
>>>>>>> (but all are in isolate list). So zone->all_unreclaimable flag would
>>>>>>> never be set for this zone
>>>>>>> and allocation function would end up looping forever. (assuming the
>>>>>>> zone NORMAL is
>>>>>>> left with no reclaimable memory)
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
>>>>>> But I don't see it's a problem of kswapd.
>>>>>
>>>>> Hi Kim,
>>>>
>>>> I like called Minchan rather than Kim
>>>> Never mind. :)
>>>>
>>>>>
>>>>> Yes I agree it is not a problem of kswapd.
>>>>
>>>> Yeb.
>>>>
>>>>>
>>>>>> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
>>>>>> but we can't allocate it. :(
>>>>>> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
>>>>>> Kswapd is just one of them confused.
>>>>>> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too.
>>>>>>
>>>>>> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
>>>>>
>>>>>
>>>>> I assume that by the inconsistency you mention above, you mean
>>>>> temporary inconsistency.
>>>>>
>>>>> Sorry, but IMHO as for memory hot plug the main issue with this patch
>>>>> is that the inconsistency you mentioned above would NOT be a temporary
>>>>> inconsistency.
>>>>>
>>>>> Every time say 'x' number of page frames are off lined, they will
>>>>> introduce a difference of 'x' pages between
>>>>> NR_FREE_PAGES and SumOf[free_area[order].nr_free].
>>>>> (So for e.g. if we do a frequent offline/online it will make
>>>>> NR_FREE_PAGES negative)
>>>>>
>>>>> This is so because, unset_migratetype_isolate() is called from
>>>>> offlining code (to set the migrate type of off lined pages again back
>>>>> to MIGRATE_MOVABLE)
>>>>> after the pages have been off lined and removed from the buddy list.
>>>>> Since the pages for which unset_migratetype_isolate() is called are
>>>>> not buddy pages so move_freepages_block() does not move any page, and
>>>>> thus introducing a permanent inconsistency.
>>>>
>>>> Good point. Negative NR_FREE_PAGES is caused by double counting by my patch and __offline_isolated_pages.
>>>> I think at first MIGRATE_ISOLATE type freed page shouldn't account as free page.
>>>>
>>>>>
>>>>>> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
>>>>>> free_area[order].nr_free exactly.
>>>>>>
>>>>>> Any thought?
>>>>>
>>>>> As for fixing move_freepages_block(), At least for memory hot plug,
>>>>> the pages stay in MIGRATE_ISOLATE list only for duration
>>>>> offline_pages() function,
>>>>> I mean only temporarily. Since fixing move_freepages_block() for will
>>>>> introduce some overhead, So I am not very sure whether that overhead
>>>>> is justified
>>>>> for a temporary condition. What do you think?
>>>>
>>>> Yes. I don't like hurt fast path, either.
>>>> How about this? (Passed just compile test :( )
>>>> The patch's goal is to NOT increase nr_free and NR_FREE_PAGES about freed page into MIGRATE_ISOLATED.
>>>>
>>>> This patch hurts high order page free path but I think it's not critical because higher order allocation
>>>> is rare than order-0 allocation and we already have done same thing on free_hot_cold_page on order-0 free path
>>>> which is more hot.
>>>
>>> Can't we change zone_water_mark_ok_safe() instead of page allocator? memory hotplug is really rare event.
>>
>>
>> +1
>>
>> Firstly, I want to make zone_page_state(z, NR_FREE_PAGES) itself more accurately because it is used by
>> several places. As I looked over places, I can't find critical places except kswapd forever sleep case.
>> So it's a nice idea!
>>
>> In that case, we need zone->lock whenever zone_watermark_ok_safe is called.
>> Most of cases, it's unnecessary and it might hurt alloc/free performance when memory pressure is high.
>> But if memory pressure is high, it may be already meaningless alloc/free performance.
>> So it does make sense, IMHO.
>>
>> Please raise your hands if anyone has a concern about this.
>>
>> barrios [at] bbo:~/linux-next$ git diff
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index d2a515d..82cc0a2 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1748,16 +1748,38 @@ bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
>> zone_page_state(z, NR_FREE_PAGES));
>> }
>>
>> -bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
>> +bool zone_watermark_ok_safe(struct zone *z, int alloc_order, unsigned long mark,
>> int classzone_idx, int alloc_flags)
>> {
>> + struct free_area *area;
>> + struct list_head *curr;
>> + int order;
>> + unsigned long flags;
>> long free_pages = zone_page_state(z, NR_FREE_PAGES);
>>
>> if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
>> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
>>
>> - return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
>> - free_pages);
>> + /*
>> + * Memory hotplug/CMA can isolate freed page into MIGRATE_ISOLATE
>> + * so that buddy can't allocate it although they are in free list.
>> + */
>> + spin_lock_irqsave(&z->lock, flags);
>> + for (order = 0; order < MAX_ORDER; order++) {
>> + int count = 0;
>> + area = &(z->free_area[order]);
>> + if (unlikely(!list_empty(&area->free_list[MIGRATE_ISOLATE]))) {
>> + list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
>> + count++;
>> + free_pages -= (count << order);
>> + }
>> + }
>> + if (free_pages < 0)
>> + free_pages = 0;
>> + spin_unlock_irqrestore(&z->lock, flags);
>> +
>> + return __zone_watermark_ok(z, alloc_order, mark,
>> + classzone_idx, alloc_flags, free_pages);
>> }
>
> number of isolate page block is almost always 0. then if we have such counter,
> we almost always can avoid zone->lock. Just idea.


Yeb. I thought about it but unfortunately we can't have a counter for MIGRATE_ISOLATE.
Because we have to tweak in page free path for pages which are going to free later after we
mark pageblock type to MIGRATE_ISOLATE.


--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


minchan at kernel

Jun 20, 2012, 5:01 PM

Post #8 of 30 (607 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

On 06/21/2012 05:19 AM, KOSAKI Motohiro wrote:

> (6/20/12 3:53 AM), Minchan Kim wrote:
>> On 06/20/2012 03:32 PM, KOSAKI Motohiro wrote:
>>
>>> (6/20/12 2:12 AM), Minchan Kim wrote:
>>>>
>>>> Hi Aaditya,
>>>>
>>>> I want to discuss this problem on another thread.
>>>>
>>>> On 06/19/2012 10:18 PM, Aaditya Kumar wrote:
>>>>> On Mon, Jun 18, 2012 at 6:13 AM, Minchan Kim <minchan [at] kernel> wrote:
>>>>>> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
>>>>>>
>>>>>>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <minchan [at] kernel> wrote:
>>>>>>>
>>>>>>>>>
>>>>>>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>>>>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>>>>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>>>>>>> pgdat_balanced()
>>>>>>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>>>>>>> normal zone has no reclaimable page.
>>>>>>>>>
>>>>>>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>>>>>>> sleep only if every zones have much free pages than high water mark
>>>>>>>>> _and_ 25% of present pages in node are free.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Sorry. I can't understand your point.
>>>>>>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>>>>>>> It seems I am missing your point.
>>>>>>>> Please anybody correct me.
>>>>>>>
>>>>>>> Since currently direct reclaim is given up based on
>>>>>>> zone->all_unreclaimable flag,
>>>>>>> so for e.g in one of the scenarios:
>>>>>>>
>>>>>>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>>>>>>> hot-remove the all the pages of the MOVABLE zone.
>>>>>>>
>>>>>>> While migrating pages during memory hot-unplugging, the allocation function
>>>>>>> (for new page to which the page in MOVABLE zone would be moved) can end up
>>>>>>> looping in direct reclaim path for ever.
>>>>>>>
>>>>>>> This is so because when most of the pages in the MOVABLE zone have
>>>>>>> been migrated,
>>>>>>> the zone now contains lots of free memory (basically above low watermark)
>>>>>>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>>>>>>
>>>>>>> So kswapd() would not balance this zone as free pages are above low watermark
>>>>>>> (but all are in isolate list). So zone->all_unreclaimable flag would
>>>>>>> never be set for this zone
>>>>>>> and allocation function would end up looping forever. (assuming the
>>>>>>> zone NORMAL is
>>>>>>> left with no reclaimable memory)
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
>>>>>> But I don't see it's a problem of kswapd.
>>>>>
>>>>> Hi Kim,
>>>>
>>>> I like called Minchan rather than Kim
>>>> Never mind. :)
>>>>
>>>>>
>>>>> Yes I agree it is not a problem of kswapd.
>>>>
>>>> Yeb.
>>>>
>>>>>
>>>>>> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
>>>>>> but we can't allocate it. :(
>>>>>> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
>>>>>> Kswapd is just one of them confused.
>>>>>> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too.
>>>>>>
>>>>>> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
>>>>>
>>>>>
>>>>> I assume that by the inconsistency you mention above, you mean
>>>>> temporary inconsistency.
>>>>>
>>>>> Sorry, but IMHO as for memory hot plug the main issue with this patch
>>>>> is that the inconsistency you mentioned above would NOT be a temporary
>>>>> inconsistency.
>>>>>
>>>>> Every time say 'x' number of page frames are off lined, they will
>>>>> introduce a difference of 'x' pages between
>>>>> NR_FREE_PAGES and SumOf[free_area[order].nr_free].
>>>>> (So for e.g. if we do a frequent offline/online it will make
>>>>> NR_FREE_PAGES negative)
>>>>>
>>>>> This is so because, unset_migratetype_isolate() is called from
>>>>> offlining code (to set the migrate type of off lined pages again back
>>>>> to MIGRATE_MOVABLE)
>>>>> after the pages have been off lined and removed from the buddy list.
>>>>> Since the pages for which unset_migratetype_isolate() is called are
>>>>> not buddy pages so move_freepages_block() does not move any page, and
>>>>> thus introducing a permanent inconsistency.
>>>>
>>>> Good point. Negative NR_FREE_PAGES is caused by double counting by my patch and __offline_isolated_pages.
>>>> I think at first MIGRATE_ISOLATE type freed page shouldn't account as free page.
>>>>
>>>>>
>>>>>> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
>>>>>> free_area[order].nr_free exactly.
>>>>>>
>>>>>> Any thought?
>>>>>
>>>>> As for fixing move_freepages_block(), At least for memory hot plug,
>>>>> the pages stay in MIGRATE_ISOLATE list only for duration
>>>>> offline_pages() function,
>>>>> I mean only temporarily. Since fixing move_freepages_block() for will
>>>>> introduce some overhead, So I am not very sure whether that overhead
>>>>> is justified
>>>>> for a temporary condition. What do you think?
>>>>
>>>> Yes. I don't like hurt fast path, either.
>>>> How about this? (Passed just compile test :( )
>>>> The patch's goal is to NOT increase nr_free and NR_FREE_PAGES about freed page into MIGRATE_ISOLATED.
>>>>
>>>> This patch hurts high order page free path but I think it's not critical because higher order allocation
>>>> is rare than order-0 allocation and we already have done same thing on free_hot_cold_page on order-0 free path
>>>> which is more hot.
>>>
>>> Can't we change zone_water_mark_ok_safe() instead of page allocator? memory hotplug is really rare event.
>>
>>
>> +1
>>
>> Firstly, I want to make zone_page_state(z, NR_FREE_PAGES) itself more accurately because it is used by
>> several places. As I looked over places, I can't find critical places except kswapd forever sleep case.
>> So it's a nice idea!
>>
>> In that case, we need zone->lock whenever zone_watermark_ok_safe is called.
>> Most of cases, it's unnecessary and it might hurt alloc/free performance when memory pressure is high.
>> But if memory pressure is high, it may be already meaningless alloc/free performance.
>> So it does make sense, IMHO.
>>
>> Please raise your hands if anyone has a concern about this.
>>
>> barrios [at] bbo:~/linux-next$ git diff
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index d2a515d..82cc0a2 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1748,16 +1748,38 @@ bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
>> zone_page_state(z, NR_FREE_PAGES));
>> }
>>
>> -bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
>> +bool zone_watermark_ok_safe(struct zone *z, int alloc_order, unsigned long mark,
>> int classzone_idx, int alloc_flags)
>> {
>> + struct free_area *area;
>> + struct list_head *curr;
>> + int order;
>> + unsigned long flags;
>> long free_pages = zone_page_state(z, NR_FREE_PAGES);
>>
>> if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
>> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
>>
>> - return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
>> - free_pages);
>> + /*
>> + * Memory hotplug/CMA can isolate freed page into MIGRATE_ISOLATE
>> + * so that buddy can't allocate it although they are in free list.
>> + */
>> + spin_lock_irqsave(&z->lock, flags);
>> + for (order = 0; order < MAX_ORDER; order++) {
>> + int count = 0;
>> + area = &(z->free_area[order]);
>> + if (unlikely(!list_empty(&area->free_list[MIGRATE_ISOLATE]))) {
>> + list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
>> + count++;
>> + free_pages -= (count << order);
>> + }
>> + }
>> + if (free_pages < 0)
>> + free_pages = 0;
>> + spin_unlock_irqrestore(&z->lock, flags);
>> +
>> + return __zone_watermark_ok(z, alloc_order, mark,
>> + classzone_idx, alloc_flags, free_pages);
>> }
>
> number of isolate page block is almost always 0. then if we have such counter,
> we almost always can avoid zone->lock. Just idea.


Yeb. I thought about it but unfortunately we can't have a counter for MIGRATE_ISOLATE.
Because we have to tweak in page free path for pages which are going to free later after we
mark pageblock type to MIGRATE_ISOLATE.


--
Kind regards,
Minchan Kim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kosaki.motohiro at gmail

Jun 20, 2012, 6:39 PM

Post #9 of 30 (610 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

>> number of isolate page block is almost always 0. then if we have such counter,
>> we almost always can avoid zone->lock. Just idea.
>
> Yeb. I thought about it but unfortunately we can't have a counter for MIGRATE_ISOLATE.
> Because we have to tweak in page free path for pages which are going to free later after we
> mark pageblock type to MIGRATE_ISOLATE.

I mean,

if (nr_isolate_pageblock != 0)
free_pages -= nr_isolated_free_pages(); // your counting logic

return __zone_watermark_ok(z, alloc_order, mark,
classzone_idx, alloc_flags, free_pages);


I don't think this logic affect your race. zone_watermark_ok() is already
racy. then new little race is no big matter.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


minchan at kernel

Jun 20, 2012, 6:55 PM

Post #10 of 30 (610 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

On 06/21/2012 10:39 AM, KOSAKI Motohiro wrote:

>>> number of isolate page block is almost always 0. then if we have such counter,
>>> we almost always can avoid zone->lock. Just idea.
>>
>> Yeb. I thought about it but unfortunately we can't have a counter for MIGRATE_ISOLATE.
>> Because we have to tweak in page free path for pages which are going to free later after we
>> mark pageblock type to MIGRATE_ISOLATE.
>
> I mean,
>
> if (nr_isolate_pageblock != 0)
> free_pages -= nr_isolated_free_pages(); // your counting logic
>
> return __zone_watermark_ok(z, alloc_order, mark,
> classzone_idx, alloc_flags, free_pages);
>
>
> I don't think this logic affect your race. zone_watermark_ok() is already
> racy. then new little race is no big matter.


It seems my explanation wasn't enough. :(
I already understand your intention but we can't make nr_isolate_pageblock.
Because we should count two type of free pages.

1. Already freed page so they are already in buddy list.
Of course, we can count it with return value of move_freepages_block(zone, page, MIGRATE_ISOLATE) easily.

2. Will be FREEed page by do_migrate_range.
It's a _PROBLEM_. For it, we should tweak free path. No?

If All of pages are PageLRU when hot-plug happens(ie, 2), nr_isolate_pagblock is zero and
zone_watermk_ok_safe can't do his role.


--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kosaki.motohiro at gmail

Jun 20, 2012, 7:45 PM

Post #11 of 30 (607 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

On Wed, Jun 20, 2012 at 9:55 PM, Minchan Kim <minchan [at] kernel> wrote:
> On 06/21/2012 10:39 AM, KOSAKI Motohiro wrote:
>
>>>> number of isolate page block is almost always 0. then if we have such counter,
>>>> we almost always can avoid zone->lock. Just idea.
>>>
>>> Yeb. I thought about it but unfortunately we can't have a counter for MIGRATE_ISOLATE.
>>> Because we have to tweak in page free path for pages which are going to free later after we
>>> mark pageblock type to MIGRATE_ISOLATE.
>>
>> I mean,
>>
>> if (nr_isolate_pageblock != 0)
>> free_pages -= nr_isolated_free_pages(); // your counting logic
>>
>> return __zone_watermark_ok(z, alloc_order, mark,
>> classzone_idx, alloc_flags, free_pages);
>>
>>
>> I don't think this logic affect your race. zone_watermark_ok() is already
>> racy. then new little race is no big matter.
>
>
> It seems my explanation wasn't enough. :(
> I already understand your intention but we can't make nr_isolate_pageblock.
> Because we should count two type of free pages.

I mean, move_freepages_block increment number of page *block*, not pages.
number of free *pages* are counted by zone_watermark_ok_safe().


> 1. Already freed page so they are already in buddy list.
> Of course, we can count it with return value of move_freepages_block(zone, page, MIGRATE_ISOLATE) easily.
>
> 2. Will be FREEed page by do_migrate_range.
> It's a _PROBLEM_. For it, we should tweak free path. No?

No.


> If All of pages are PageLRU when hot-plug happens(ie, 2), nr_isolate_pagblock is zero and
> zone_watermk_ok_safe can't do his role.

number of isolate pageblock don't depend on number of free pages. It's
a concept of
an attribute of PFN range.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


minchan at kernel

Jun 20, 2012, 9:55 PM

Post #12 of 30 (600 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

On 06/21/2012 11:45 AM, KOSAKI Motohiro wrote:

> On Wed, Jun 20, 2012 at 9:55 PM, Minchan Kim <minchan [at] kernel> wrote:
>> On 06/21/2012 10:39 AM, KOSAKI Motohiro wrote:
>>
>>>>> number of isolate page block is almost always 0. then if we have such counter,
>>>>> we almost always can avoid zone->lock. Just idea.
>>>>
>>>> Yeb. I thought about it but unfortunately we can't have a counter for MIGRATE_ISOLATE.
>>>> Because we have to tweak in page free path for pages which are going to free later after we
>>>> mark pageblock type to MIGRATE_ISOLATE.
>>>
>>> I mean,
>>>
>>> if (nr_isolate_pageblock != 0)
>>> free_pages -= nr_isolated_free_pages(); // your counting logic
>>>
>>> return __zone_watermark_ok(z, alloc_order, mark,
>>> classzone_idx, alloc_flags, free_pages);
>>>
>>>
>>> I don't think this logic affect your race. zone_watermark_ok() is already
>>> racy. then new little race is no big matter.
>>
>>
>> It seems my explanation wasn't enough. :(
>> I already understand your intention but we can't make nr_isolate_pageblock.
>> Because we should count two type of free pages.
>
> I mean, move_freepages_block increment number of page *block*, not pages.
> number of free *pages* are counted by zone_watermark_ok_safe().
>
>
>> 1. Already freed page so they are already in buddy list.
>> Of course, we can count it with return value of move_freepages_block(zone, page, MIGRATE_ISOLATE) easily.
>>
>> 2. Will be FREEed page by do_migrate_range.
>> It's a _PROBLEM_. For it, we should tweak free path. No?
>
> No.
>
>
>> If All of pages are PageLRU when hot-plug happens(ie, 2), nr_isolate_pagblock is zero and
>> zone_watermk_ok_safe can't do his role.
>
> number of isolate pageblock don't depend on number of free pages. It's
> a concept of
> an attribute of PFN range.


It seems you mean is_migrate_isolate as a just flag, NOT nr_isolate_pageblock.
So do you mean this?

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 3bdcab3..7f4d19c 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -1,6 +1,7 @@
#ifndef __LINUX_PAGEISOLATION_H
#define __LINUX_PAGEISOLATION_H

+extern bool is_migrate_isolate;
/*
* Changes migrate type in [.start_pfn, end_pfn) to be MIGRATE_ISOLATE.
* If specified range includes migrate types other than MOVABLE or CMA,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d2a515d..b997cb3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1756,6 +1756,27 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long ma
if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);

+#if defined CONFIG_CMA || CONFIG_MEMORY_HOTPLUG
+ if (unlikely(is_migrate_isolate)) {
+ unsigned long flags;
+ spin_lock_irqsave(&z->lock, flags);
+ for (order = MAX_ORDER - 1; order >= 0; order--) {
+ struct free_area *area = &z->free_area[order];
+ long count = 0;
+ struct list_head *curr;
+
+ list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
+ count++;
+
+ free_pages -= (count << order);
+ if (free_pages < 0) {
+ free_pages = 0;
+ break;
+ }
+ }
+ spin_unlock_irqrestore(&z->lock, flags);
+ }
+#endif
return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
free_pages);
}
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index c9f0477..212e526 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -19,6 +19,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
return pfn_to_page(pfn + i);
}

+bool is_migrate_isolate = false;
+
/*
* start_isolate_page_range() -- make page-allocation-type of range of pages
* to be MIGRATE_ISOLATE.
@@ -43,6 +45,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
BUG_ON((start_pfn) & (pageblock_nr_pages - 1));
BUG_ON((end_pfn) & (pageblock_nr_pages - 1));

+ is_migrate_isolate = true;
+
for (pfn = start_pfn;
pfn < end_pfn;
pfn += pageblock_nr_pages) {
@@ -59,6 +63,7 @@ undo:
pfn += pageblock_nr_pages)
unset_migratetype_isolate(pfn_to_page(pfn), migratetype);

+ is_migrate_isolate = false;
return -EBUSY;
}

@@ -80,6 +85,9 @@ int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
continue;
unset_migratetype_isolate(page, migratetype);
}
+
+ is_migrate_isolate = false;
+
return 0;
}
/*

It is still racy as you already mentioned and I don't think it's trivial.
Direct reclaim can't wake up kswapd forever by current fragile zone->all_unreclaimable.
So it's a livelock.
Then, do you want to fix this problem by your patch[1]?

It could solve the livelock by OOM kill if we apply your patch[1] but still doesn't wake up
kswapd although it's not critical. Okay. Then, please write down this problem in detail
in your patch's changelog and resend, please.

[1] http://lkml.org/lkml/2012/6/14/74

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kamezawa.hiroyu at jp

Jun 21, 2012, 3:52 AM

Post #13 of 30 (604 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

(2012/06/21 13:55), Minchan Kim wrote:
> On 06/21/2012 11:45 AM, KOSAKI Motohiro wrote:
>
>> On Wed, Jun 20, 2012 at 9:55 PM, Minchan Kim<minchan [at] kernel> wrote:
>>> On 06/21/2012 10:39 AM, KOSAKI Motohiro wrote:
>>>
>>>>>> number of isolate page block is almost always 0. then if we have such counter,
>>>>>> we almost always can avoid zone->lock. Just idea.
>>>>>
>>>>> Yeb. I thought about it but unfortunately we can't have a counter for MIGRATE_ISOLATE.
>>>>> Because we have to tweak in page free path for pages which are going to free later after we
>>>>> mark pageblock type to MIGRATE_ISOLATE.
>>>>
>>>> I mean,
>>>>
>>>> if (nr_isolate_pageblock != 0)
>>>> free_pages -= nr_isolated_free_pages(); // your counting logic
>>>>
>>>> return __zone_watermark_ok(z, alloc_order, mark,
>>>> classzone_idx, alloc_flags, free_pages);
>>>>
>>>>
>>>> I don't think this logic affect your race. zone_watermark_ok() is already
>>>> racy. then new little race is no big matter.
>>>
>>>
>>> It seems my explanation wasn't enough. :(
>>> I already understand your intention but we can't make nr_isolate_pageblock.
>>> Because we should count two type of free pages.
>>
>> I mean, move_freepages_block increment number of page *block*, not pages.
>> number of free *pages* are counted by zone_watermark_ok_safe().
>>
>>
>>> 1. Already freed page so they are already in buddy list.
>>> Of course, we can count it with return value of move_freepages_block(zone, page, MIGRATE_ISOLATE) easily.
>>>
>>> 2. Will be FREEed page by do_migrate_range.
>>> It's a _PROBLEM_. For it, we should tweak free path. No?
>>
>> No.
>>
>>
>>> If All of pages are PageLRU when hot-plug happens(ie, 2), nr_isolate_pagblock is zero and
>>> zone_watermk_ok_safe can't do his role.
>>
>> number of isolate pageblock don't depend on number of free pages. It's
>> a concept of
>> an attribute of PFN range.
>
>
> It seems you mean is_migrate_isolate as a just flag, NOT nr_isolate_pageblock.
> So do you mean this?
>
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 3bdcab3..7f4d19c 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -1,6 +1,7 @@
> #ifndef __LINUX_PAGEISOLATION_H
> #define __LINUX_PAGEISOLATION_H
>
> +extern bool is_migrate_isolate;
> /*
> * Changes migrate type in [.start_pfn, end_pfn) to be MIGRATE_ISOLATE.
> * If specified range includes migrate types other than MOVABLE or CMA,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d2a515d..b997cb3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1756,6 +1756,27 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long ma
> if (z->percpu_drift_mark&& free_pages< z->percpu_drift_mark)
> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
>
> +#if defined CONFIG_CMA || CONFIG_MEMORY_HOTPLUG
> + if (unlikely(is_migrate_isolate)) {
> + unsigned long flags;
> + spin_lock_irqsave(&z->lock, flags);
> + for (order = MAX_ORDER - 1; order>= 0; order--) {
> + struct free_area *area =&z->free_area[order];
> + long count = 0;
> + struct list_head *curr;
> +
> + list_for_each(curr,&area->free_list[MIGRATE_ISOLATE])
> + count++;
> +
> + free_pages -= (count<< order);
> + if (free_pages< 0) {
> + free_pages = 0;
> + break;
> + }
> + }
> + spin_unlock_irqrestore(&z->lock, flags);
> + }
> +#endif
> return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
> free_pages);
> }
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index c9f0477..212e526 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -19,6 +19,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
> return pfn_to_page(pfn + i);
> }
>
> +bool is_migrate_isolate = false;
> +
> /*
> * start_isolate_page_range() -- make page-allocation-type of range of pages
> * to be MIGRATE_ISOLATE.
> @@ -43,6 +45,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> BUG_ON((start_pfn)& (pageblock_nr_pages - 1));
> BUG_ON((end_pfn)& (pageblock_nr_pages - 1));
>
> + is_migrate_isolate = true;
> +
> for (pfn = start_pfn;
> pfn< end_pfn;
> pfn += pageblock_nr_pages) {
> @@ -59,6 +63,7 @@ undo:
> pfn += pageblock_nr_pages)
> unset_migratetype_isolate(pfn_to_page(pfn), migratetype);
>
> + is_migrate_isolate = false;
> return -EBUSY;
> }
>
> @@ -80,6 +85,9 @@ int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> continue;
> unset_migratetype_isolate(page, migratetype);
> }
> +
> + is_migrate_isolate = false;
> +
> return 0;
> }
> /*
>
> It is still racy as you already mentioned and I don't think it's trivial.
> Direct reclaim can't wake up kswapd forever by current fragile zone->all_unreclaimable.
> So it's a livelock.
> Then, do you want to fix this problem by your patch[1]?
>
> It could solve the livelock by OOM kill if we apply your patch[1] but still doesn't wake up
> kswapd although it's not critical. Okay. Then, please write down this problem in detail
> in your patch's changelog and resend, please.
>

Hm. I'm sorry if I couldn't chase the disucussion...Can I make summary ?

As you shown, it seems to be not difficult to counting free pages under MIGRATE_ISOLATE.
And we can know the zone contains MIGRATE_ISOLATE area or not by simple check.
for example.
==
set_pageblock_migratetype(page, MIGRATE_ISOLATE);
move_freepages_block(zone, page, MIGRATE_ISOLATE);
zone->nr_isolated_areas++;
=

Then, the solution will be adding a function like following
=
u64 zone_nr_free_pages(struct zone *zone) {
unsigned long free_pages;

free_pages = zone_page_state(NR_FREE_PAGES);
if (unlikely(z->nr_isolated_areas)) {
isolated = count_migrate_isolated_pages(zone);
free_pages -= isolated;
}
return free_pages;
}
=

Right ? and... zone->all_unreclaimable is a different problem ?

Thanks,
-Kame












--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


aaditya.kumar.30 at gmail

Jun 21, 2012, 4:02 AM

Post #14 of 30 (613 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

On Thu, Jun 21, 2012 at 10:25 AM, Minchan Kim <minchan [at] kernel> wrote:
> On 06/21/2012 11:45 AM, KOSAKI Motohiro wrote:
>
>> On Wed, Jun 20, 2012 at 9:55 PM, Minchan Kim <minchan [at] kernel> wrote:
>>> On 06/21/2012 10:39 AM, KOSAKI Motohiro wrote:
>>>
>>>>>> number of isolate page block is almost always 0. then if we have such counter,
>>>>>> we almost always can avoid zone->lock. Just idea.
>>>>>
>>>>> Yeb. I thought about it but unfortunately we can't have a counter for MIGRATE_ISOLATE.
>>>>> Because we have to tweak in page free path for pages which are going to free later after we
>>>>> mark pageblock type to MIGRATE_ISOLATE.
>>>>
>>>> I mean,
>>>>
>>>> if (nr_isolate_pageblock != 0)
>>>> free_pages -= nr_isolated_free_pages(); // your counting logic
>>>>
>>>> return __zone_watermark_ok(z, alloc_order, mark,
>>>> classzone_idx, alloc_flags, free_pages);
>>>>
>>>>
>>>> I don't think this logic affect your race. zone_watermark_ok() is already
>>>> racy. then new little race is no big matter.
>>>
>>>
>>> It seems my explanation wasn't enough. :(
>>> I already understand your intention but we can't make nr_isolate_pageblock.
>>> Because we should count two type of free pages.
>>
>> I mean, move_freepages_block increment number of page *block*, not pages.
>> number of free *pages* are counted by zone_watermark_ok_safe().
>>
>>
>>> 1. Already freed page so they are already in buddy list.
>>> Of course, we can count it with return value of move_freepages_block(zone, page, MIGRATE_ISOLATE) easily.
>>>
>>> 2. Will be FREEed page by do_migrate_range.
>>> It's a _PROBLEM_. For it, we should tweak free path. No?
>>
>> No.
>>
>>
>>> If All of pages are PageLRU when hot-plug happens(ie, 2), nr_isolate_pagblock is zero and
>>> zone_watermk_ok_safe can't do his role.
>>
>> number of isolate pageblock don't depend on number of free pages. It's
>> a concept of
>> an attribute of PFN range.
>
>
> It seems you mean is_migrate_isolate as a just flag, NOT nr_isolate_pageblock.
> So do you mean this?
>
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 3bdcab3..7f4d19c 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -1,6 +1,7 @@
> #ifndef __LINUX_PAGEISOLATION_H
> #define __LINUX_PAGEISOLATION_H
>
> +extern bool is_migrate_isolate;
> /*
> * Changes migrate type in [.start_pfn, end_pfn) to be MIGRATE_ISOLATE.
> * If specified range includes migrate types other than MOVABLE or CMA,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d2a515d..b997cb3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1756,6 +1756,27 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long ma
> if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
>
> +#if defined CONFIG_CMA || CONFIG_MEMORY_HOTPLUG
> + if (unlikely(is_migrate_isolate)) {
> + unsigned long flags;
> + spin_lock_irqsave(&z->lock, flags);
> + for (order = MAX_ORDER - 1; order >= 0; order--) {
> + struct free_area *area = &z->free_area[order];
> + long count = 0;
> + struct list_head *curr;
> +
> + list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
> + count++;
> +
> + free_pages -= (count << order);
> + if (free_pages < 0) {
> + free_pages = 0;
> + break;
> + }
> + }
> + spin_unlock_irqrestore(&z->lock, flags);
> + }
> +#endif
> return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
> free_pages);
> }
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index c9f0477..212e526 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -19,6 +19,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
> return pfn_to_page(pfn + i);
> }
>
> +bool is_migrate_isolate = false;
> +
> /*
> * start_isolate_page_range() -- make page-allocation-type of range of pages
> * to be MIGRATE_ISOLATE.
> @@ -43,6 +45,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> BUG_ON((start_pfn) & (pageblock_nr_pages - 1));
> BUG_ON((end_pfn) & (pageblock_nr_pages - 1));
>
> + is_migrate_isolate = true;
> +
> for (pfn = start_pfn;
> pfn < end_pfn;
> pfn += pageblock_nr_pages) {
> @@ -59,6 +63,7 @@ undo:
> pfn += pageblock_nr_pages)
> unset_migratetype_isolate(pfn_to_page(pfn), migratetype);
>
> + is_migrate_isolate = false;
> return -EBUSY;
> }
>
> @@ -80,6 +85,9 @@ int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> continue;
> unset_migratetype_isolate(page, migratetype);
> }
> +
> + is_migrate_isolate = false;
> +
> return 0;
> }
> /*
>

Hello Minchan,

Sorry for delayed response.

Instead of above how about something like this:

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 3bdcab3..fe9215f 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -34,4 +34,6 @@ extern int set_migratetype_isolate(struct page *page);
extern void unset_migratetype_isolate(struct page *page, unsigned migratetype);


+extern atomic_t is_migrate_isolated;
+
#endif
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ab1e714..e076fa2 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1381,6 +1381,7 @@ static int get_any_page(struct page *p, unsigned
long pfn, int flags)
* Isolate the page, so that it doesn't get reallocated if it
* was free.
*/
+ atomic_inc(&is_migrate_isolated);
set_migratetype_isolate(p);
/*
* When the target page is a free hugepage, just remove it
@@ -1406,6 +1407,7 @@ static int get_any_page(struct page *p, unsigned
long pfn, int flags)
}
unset_migratetype_isolate(p, MIGRATE_MOVABLE);
unlock_memory_hotplug();
+ atomic_dec(&is_migrate_isolated);
return ret;
}

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0d7e3ec..cd7805c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -892,6 +892,7 @@ static int __ref offline_pages(unsigned long start_pfn,
nr_pages = end_pfn - start_pfn;

/* set above range as isolated */
+ atomic_inc(&is_migrate_isolated);
ret = start_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
if (ret)
goto out;
@@ -958,6 +959,7 @@ repeat:
offline_isolated_pages(start_pfn, end_pfn);
/* reset pagetype flags and makes migrate type to be MOVABLE */
undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
+ atomic_dec(&is_migrate_isolated);
/* removal success */
zone->present_pages -= offlined_pages;
zone->zone_pgdat->node_present_pages -= offlined_pages;
@@ -986,6 +988,7 @@ failed_removal:
undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);

out:
+ atomic_dec(&is_migrate_isolated);
unlock_memory_hotplug();
return ret;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4403009..f549361 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1632,6 +1632,28 @@ bool zone_watermark_ok_safe(struct zone *z, int
order, unsigned long mark,
if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);

+#if defined CONFIG_CMA || CONFIG_MEMORY_HOTPLUG
+ if (unlikely(atomic_read(is_migrate_isolated)) {
+ unsigned long flags;
+ spin_lock_irqsave(&z->lock, flags);
+ for (order = MAX_ORDER - 1; order >= 0; order--) {
+ struct free_area *area = &z->free_area[order];
+ long count = 0;
+ struct list_head *curr;
+
+ list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
+ count++;
+
+ free_pages -= (count << order);
+ if (free_pages < 0) {
+ free_pages = 0;
+ break;
+ }
+ }
+ spin_unlock_irqrestore(&z->lock, flags);
+ }
+#endif
+
return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
free_pages);
}
@@ -5785,6 +5807,7 @@ int alloc_contig_range(unsigned long start,
unsigned long end,
* put back to page allocator so that buddy can use them.
*/

+ atomic_inc(&is_migrate_isolated);
ret = start_isolate_page_range(pfn_max_align_down(start),
pfn_max_align_up(end), migratetype);
if (ret)
@@ -5854,6 +5877,7 @@ int alloc_contig_range(unsigned long start,
unsigned long end,
done:
undo_isolate_page_range(pfn_max_align_down(start),
pfn_max_align_up(end), migratetype);
+ atomic_dec(&is_migrate_isolated);
return ret;
}

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index c9f0477..e8eb241 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -19,6 +19,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
return pfn_to_page(pfn + i);
}

+atomic_t is_migrate_isolated;
+
/*
* start_isolate_page_range() -- make page-allocation-type of range of pages
* to be MIGRATE_ISOLATE.


> It is still racy as you already mentioned and I don't think it's trivial.
> Direct reclaim can't wake up kswapd forever by current fragile zone->all_unreclaimable.
> So it's a livelock.
> Then, do you want to fix this problem by your patch[1]?
>
> It could solve the livelock by OOM kill if we apply your patch[1] but still doesn't wake up
> kswapd although it's not critical. Okay. Then, please write down this problem in detail
> in your patch's changelog and resend, please.
>
> [1] http://lkml.org/lkml/2012/6/14/74
>
> --
> Kind regards,
> Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kosaki.motohiro at gmail

Jun 21, 2012, 10:22 AM

Post #15 of 30 (601 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

>
> Hm. I'm sorry if I couldn't chase the disucussion...Can I make summary ?
>
> As you shown, it seems to be not difficult to counting free pages under
> MIGRATE_ISOLATE.
> And we can know the zone contains MIGRATE_ISOLATE area or not by simple
> check.
> for example.
> ==
> set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> move_freepages_block(zone, page, MIGRATE_ISOLATE);
> zone->nr_isolated_areas++;
> =
>
> Then, the solution will be adding a function like following
> =
> u64 zone_nr_free_pages(struct zone *zone) {
> unsigned long free_pages;
>
> free_pages = zone_page_state(NR_FREE_PAGES);
> if (unlikely(z->nr_isolated_areas)) {
> isolated = count_migrate_isolated_pages(zone);
> free_pages -= isolated;
> }
> return free_pages;
> }
> =
>
> Right ?

This represent my intention exactly. :)

> and... zone->all_unreclaimable is a different problem ?

Yes, all_unreclaimable derived livelock don't depend on memory hotplug.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


minchan at kernel

Jun 21, 2012, 6:05 PM

Post #16 of 30 (600 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

Hi Kame,

On 06/21/2012 07:52 PM, Kamezawa Hiroyuki wrote:

> (2012/06/21 13:55), Minchan Kim wrote:
>> On 06/21/2012 11:45 AM, KOSAKI Motohiro wrote:
>>
>>> On Wed, Jun 20, 2012 at 9:55 PM, Minchan Kim<minchan [at] kernel> wrote:
>>>> On 06/21/2012 10:39 AM, KOSAKI Motohiro wrote:
>>>>
>>>>>>> number of isolate page block is almost always 0. then if we have
>>>>>>> such counter,
>>>>>>> we almost always can avoid zone->lock. Just idea.
>>>>>>
>>>>>> Yeb. I thought about it but unfortunately we can't have a counter
>>>>>> for MIGRATE_ISOLATE.
>>>>>> Because we have to tweak in page free path for pages which are
>>>>>> going to free later after we
>>>>>> mark pageblock type to MIGRATE_ISOLATE.
>>>>>
>>>>> I mean,
>>>>>
>>>>> if (nr_isolate_pageblock != 0)
>>>>> free_pages -= nr_isolated_free_pages(); // your counting logic
>>>>>
>>>>> return __zone_watermark_ok(z, alloc_order, mark,
>>>>> classzone_idx, alloc_flags,
>>>>> free_pages);
>>>>>
>>>>>
>>>>> I don't think this logic affect your race. zone_watermark_ok() is
>>>>> already
>>>>> racy. then new little race is no big matter.
>>>>
>>>>
>>>> It seems my explanation wasn't enough. :(
>>>> I already understand your intention but we can't make
>>>> nr_isolate_pageblock.
>>>> Because we should count two type of free pages.
>>>
>>> I mean, move_freepages_block increment number of page *block*, not
>>> pages.
>>> number of free *pages* are counted by zone_watermark_ok_safe().
>>>
>>>
>>>> 1. Already freed page so they are already in buddy list.
>>>> Of course, we can count it with return value of
>>>> move_freepages_block(zone, page, MIGRATE_ISOLATE) easily.
>>>>
>>>> 2. Will be FREEed page by do_migrate_range.
>>>> It's a _PROBLEM_. For it, we should tweak free path. No?
>>>
>>> No.
>>>
>>>
>>>> If All of pages are PageLRU when hot-plug happens(ie, 2),
>>>> nr_isolate_pagblock is zero and
>>>> zone_watermk_ok_safe can't do his role.
>>>
>>> number of isolate pageblock don't depend on number of free pages. It's
>>> a concept of
>>> an attribute of PFN range.
>>
>>
>> It seems you mean is_migrate_isolate as a just flag, NOT
>> nr_isolate_pageblock.
>> So do you mean this?
>>
>> diff --git a/include/linux/page-isolation.h
>> b/include/linux/page-isolation.h
>> index 3bdcab3..7f4d19c 100644
>> --- a/include/linux/page-isolation.h
>> +++ b/include/linux/page-isolation.h
>> @@ -1,6 +1,7 @@
>> #ifndef __LINUX_PAGEISOLATION_H
>> #define __LINUX_PAGEISOLATION_H
>>
>> +extern bool is_migrate_isolate;
>> /*
>> * Changes migrate type in [.start_pfn, end_pfn) to be MIGRATE_ISOLATE.
>> * If specified range includes migrate types other than MOVABLE or CMA,
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index d2a515d..b997cb3 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1756,6 +1756,27 @@ bool zone_watermark_ok_safe(struct zone *z, int
>> order, unsigned long ma
>> if (z->percpu_drift_mark&& free_pages< z->percpu_drift_mark)
>> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
>>
>> +#if defined CONFIG_CMA || CONFIG_MEMORY_HOTPLUG
>> + if (unlikely(is_migrate_isolate)) {
>> + unsigned long flags;
>> + spin_lock_irqsave(&z->lock, flags);
>> + for (order = MAX_ORDER - 1; order>= 0; order--) {
>> + struct free_area *area =&z->free_area[order];
>> + long count = 0;
>> + struct list_head *curr;
>> +
>> +
>> list_for_each(curr,&area->free_list[MIGRATE_ISOLATE])
>> + count++;
>> +
>> + free_pages -= (count<< order);
>> + if (free_pages< 0) {
>> + free_pages = 0;
>> + break;
>> + }
>> + }
>> + spin_unlock_irqrestore(&z->lock, flags);
>> + }
>> +#endif
>> return __zone_watermark_ok(z, order, mark, classzone_idx,
>> alloc_flags,
>>
>> free_pages);
>> }
>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>> index c9f0477..212e526 100644
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -19,6 +19,8 @@ __first_valid_page(unsigned long pfn, unsigned long
>> nr_pages)
>> return pfn_to_page(pfn + i);
>> }
>>
>> +bool is_migrate_isolate = false;
>> +
>> /*
>> * start_isolate_page_range() -- make page-allocation-type of range
>> of pages
>> * to be MIGRATE_ISOLATE.
>> @@ -43,6 +45,8 @@ int start_isolate_page_range(unsigned long
>> start_pfn, unsigned long end_pfn,
>> BUG_ON((start_pfn)& (pageblock_nr_pages - 1));
>> BUG_ON((end_pfn)& (pageblock_nr_pages - 1));
>>
>> + is_migrate_isolate = true;
>> +
>> for (pfn = start_pfn;
>> pfn< end_pfn;
>> pfn += pageblock_nr_pages) {
>> @@ -59,6 +63,7 @@ undo:
>> pfn += pageblock_nr_pages)
>> unset_migratetype_isolate(pfn_to_page(pfn),
>> migratetype);
>>
>> + is_migrate_isolate = false;
>> return -EBUSY;
>> }
>>
>> @@ -80,6 +85,9 @@ int undo_isolate_page_range(unsigned long start_pfn,
>> unsigned long end_pfn,
>> continue;
>> unset_migratetype_isolate(page, migratetype);
>> }
>> +
>> + is_migrate_isolate = false;
>> +
>> return 0;
>> }
>> /*
>>
>> It is still racy as you already mentioned and I don't think it's trivial.
>> Direct reclaim can't wake up kswapd forever by current fragile
>> zone->all_unreclaimable.
>> So it's a livelock.
>> Then, do you want to fix this problem by your patch[1]?
>>
>> It could solve the livelock by OOM kill if we apply your patch[1] but
>> still doesn't wake up
>> kswapd although it's not critical. Okay. Then, please write down this
>> problem in detail
>> in your patch's changelog and resend, please.
>>
>
> Hm. I'm sorry if I couldn't chase the disucussion...Can I make summary ?
>
> As you shown, it seems to be not difficult to counting free pages under
> MIGRATE_ISOLATE.
> And we can know the zone contains MIGRATE_ISOLATE area or not by simple
> check.
> for example.
> ==
> set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> move_freepages_block(zone, page, MIGRATE_ISOLATE);
> zone->nr_isolated_areas++;
> =
>
> Then, the solution will be adding a function like following
> =
> u64 zone_nr_free_pages(struct zone *zone) {
> unsigned long free_pages;
>
> free_pages = zone_page_state(NR_FREE_PAGES);
> if (unlikely(z->nr_isolated_areas)) {
> isolated = count_migrate_isolated_pages(zone);
> free_pages -= isolated;
> }
> return free_pages;
> }
> =
>
> Right ? and... zone->all_unreclaimable is a different problem ?


Let me summary again.

The problem:

when hotplug offlining happens on zone A, it starts to freed page as MIGRATE_ISOLATE type in buddy.
(MIGRATE_ISOLATE is very irony type because it's apparently on buddy but we can't allocate them)
When the memory shortage happens during hotplug offlining, current task starts to reclaim, then wake up kswapd.
Kswapd checks watermark, then go sleep BECAUSE current zone_watermark_ok_safe doesn't consider
MIGRATE_ISOLATE freed page count. Current task continue to reclaim in direct reclaim path without kswapd's help.
The problem is that zone->all_unreclaimable is set by only kswapd so that current task would be looping forever
like below.

__alloc_pages_slowpath
restart:
wake_all_kswapd
rebalance:
__alloc_pages_direct_reclaim
do_try_to_free_pages
if global_reclaim && !all_unreclaimable
return 1; /* It means we did did_some_progress */
skip __alloc_pages_may_oom
should_alloc_retry
goto rebalance;

If we apply KOSAKI's patch[1] which doesn't depends on kswapd about setting zone->all_unreclaimable,
we can solve this problem by killing some task. But it doesn't wake up kswapd, still.
It could be a problem still if other subsystem needs GFP_ATOMIC request.
So kswapd should consider MIGRATE_ISOLATE when it calculate free pages before going sleep.

Firstly I tried to solve this problem by this.
https://lkml.org/lkml/2012/6/20/30
The patch's goal was to NOT increase nr_free and NR_FREE_PAGES when we free page into MIGRATE_ISOLATED.
But it increases little overhead in higher order free page but I think it's not a big deal.
More problem is duplicated codes for handling only MIGRATE_ISOLATE freed page.

Second approach which is suggested by KOSAKI is what you mentioned.
But the concern about second approach is how to make sure matched count increase/decrease of nr_isolated_areas.
I mean how to make sure nr_isolated_areas would be zero when isolation is done.
Of course, we can investigate all of current caller and make sure they don't make mistake
now. But it's very error-prone if we consider future's user.
So we might need test_set_pageblock_migratetype(page, MIGRATE_ISOLATE);

IMHO, ideal solution is that we remove MIGRATE_ISOLATE type totally in buddy.
For it, there is no problem to isolate already freed page in buddy allocator but the concern is how to handle
freed page later by do_migrate_range in memory_hotplug.c.
We can create custom putback_lru_pages

put_page_hotplug(page)
{
int migratetype = get_pageblock_migratetype(page)
VM_BUG_ON(migratetype != MIGRATE_ISOLATE);
__page_cache_release(page);
free_one_page(zone, page, 0, MIGRATE_ISOLATE);
}

putback_lru_pages_hotplug(&source)
{
foreach page from source
put_page_hotplug(page)
}

do_migrate_range()
{
migrate_pages(&source);
putback_lru_pages_hotplug(&source);
}

I hope this summary can help you, Kame and If I miss something, please let me know it.

[1] http://lkml.org/lkml/2012/6/14/74

Thanks.

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


minchan at kernel

Jun 21, 2012, 6:20 PM

Post #17 of 30 (594 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

Hi Aaditya,

On 06/21/2012 08:02 PM, Aaditya Kumar wrote:

> On Thu, Jun 21, 2012 at 10:25 AM, Minchan Kim <minchan [at] kernel> wrote:
>> On 06/21/2012 11:45 AM, KOSAKI Motohiro wrote:
>>
>>> On Wed, Jun 20, 2012 at 9:55 PM, Minchan Kim <minchan [at] kernel> wrote:
>>>> On 06/21/2012 10:39 AM, KOSAKI Motohiro wrote:
>>>>
>>>>>>> number of isolate page block is almost always 0. then if we have such counter,
>>>>>>> we almost always can avoid zone->lock. Just idea.
>>>>>>
>>>>>> Yeb. I thought about it but unfortunately we can't have a counter for MIGRATE_ISOLATE.
>>>>>> Because we have to tweak in page free path for pages which are going to free later after we
>>>>>> mark pageblock type to MIGRATE_ISOLATE.
>>>>>
>>>>> I mean,
>>>>>
>>>>> if (nr_isolate_pageblock != 0)
>>>>> free_pages -= nr_isolated_free_pages(); // your counting logic
>>>>>
>>>>> return __zone_watermark_ok(z, alloc_order, mark,
>>>>> classzone_idx, alloc_flags, free_pages);
>>>>>
>>>>>
>>>>> I don't think this logic affect your race. zone_watermark_ok() is already
>>>>> racy. then new little race is no big matter.
>>>>
>>>>
>>>> It seems my explanation wasn't enough. :(
>>>> I already understand your intention but we can't make nr_isolate_pageblock.
>>>> Because we should count two type of free pages.
>>>
>>> I mean, move_freepages_block increment number of page *block*, not pages.
>>> number of free *pages* are counted by zone_watermark_ok_safe().
>>>
>>>
>>>> 1. Already freed page so they are already in buddy list.
>>>> Of course, we can count it with return value of move_freepages_block(zone, page, MIGRATE_ISOLATE) easily.
>>>>
>>>> 2. Will be FREEed page by do_migrate_range.
>>>> It's a _PROBLEM_. For it, we should tweak free path. No?
>>>
>>> No.
>>>
>>>
>>>> If All of pages are PageLRU when hot-plug happens(ie, 2), nr_isolate_pagblock is zero and
>>>> zone_watermk_ok_safe can't do his role.
>>>
>>> number of isolate pageblock don't depend on number of free pages. It's
>>> a concept of
>>> an attribute of PFN range.
>>
>>
>> It seems you mean is_migrate_isolate as a just flag, NOT nr_isolate_pageblock.
>> So do you mean this?
>>
>> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
>> index 3bdcab3..7f4d19c 100644
>> --- a/include/linux/page-isolation.h
>> +++ b/include/linux/page-isolation.h
>> @@ -1,6 +1,7 @@
>> #ifndef __LINUX_PAGEISOLATION_H
>> #define __LINUX_PAGEISOLATION_H
>>
>> +extern bool is_migrate_isolate;
>> /*
>> * Changes migrate type in [.start_pfn, end_pfn) to be MIGRATE_ISOLATE.
>> * If specified range includes migrate types other than MOVABLE or CMA,
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index d2a515d..b997cb3 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1756,6 +1756,27 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long ma
>> if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
>> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
>>
>> +#if defined CONFIG_CMA || CONFIG_MEMORY_HOTPLUG
>> + if (unlikely(is_migrate_isolate)) {
>> + unsigned long flags;
>> + spin_lock_irqsave(&z->lock, flags);
>> + for (order = MAX_ORDER - 1; order >= 0; order--) {
>> + struct free_area *area = &z->free_area[order];
>> + long count = 0;
>> + struct list_head *curr;
>> +
>> + list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
>> + count++;
>> +
>> + free_pages -= (count << order);
>> + if (free_pages < 0) {
>> + free_pages = 0;
>> + break;
>> + }
>> + }
>> + spin_unlock_irqrestore(&z->lock, flags);
>> + }
>> +#endif
>> return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
>> free_pages);
>> }
>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>> index c9f0477..212e526 100644
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -19,6 +19,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>> return pfn_to_page(pfn + i);
>> }
>>
>> +bool is_migrate_isolate = false;
>> +
>> /*
>> * start_isolate_page_range() -- make page-allocation-type of range of pages
>> * to be MIGRATE_ISOLATE.
>> @@ -43,6 +45,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>> BUG_ON((start_pfn) & (pageblock_nr_pages - 1));
>> BUG_ON((end_pfn) & (pageblock_nr_pages - 1));
>>
>> + is_migrate_isolate = true;
>> +
>> for (pfn = start_pfn;
>> pfn < end_pfn;
>> pfn += pageblock_nr_pages) {
>> @@ -59,6 +63,7 @@ undo:
>> pfn += pageblock_nr_pages)
>> unset_migratetype_isolate(pfn_to_page(pfn), migratetype);
>>
>> + is_migrate_isolate = false;
>> return -EBUSY;
>> }
>>
>> @@ -80,6 +85,9 @@ int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>> continue;
>> unset_migratetype_isolate(page, migratetype);
>> }
>> +
>> + is_migrate_isolate = false;
>> +
>> return 0;
>> }
>> /*
>>
>
> Hello Minchan,
>
> Sorry for delayed response.
>
> Instead of above how about something like this:
>
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 3bdcab3..fe9215f 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -34,4 +34,6 @@ extern int set_migratetype_isolate(struct page *page);
> extern void unset_migratetype_isolate(struct page *page, unsigned migratetype);
>
>
> +extern atomic_t is_migrate_isolated;

> +

> #endif
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index ab1e714..e076fa2 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1381,6 +1381,7 @@ static int get_any_page(struct page *p, unsigned
> long pfn, int flags)
> * Isolate the page, so that it doesn't get reallocated if it
> * was free.
> */
> + atomic_inc(&is_migrate_isolated);


I didn't take a detail look in your patch yet.

Yes. In my patch, I missed several caller.
It was just a patch for showing my intention, NOT formal patch.
But I admit I didn't consider nesting case. brain-dead :(
Technically other problem about this is atomic doesn't imply memory barrier so
we need barrier.

But the concern about this approach is following as
Copy/Paste from my reply of Kame.

***
But the concern about second approach is how to make sure matched count increase/decrease of nr_isolated_areas.
I mean how to make sure nr_isolated_areas would be zero when isolation is done.
Of course, we can investigate all of current caller and make sure they don't make mistake
now. But it's very error-prone if we consider future's user.
So we might need test_set_pageblock_migratetype(page, MIGRATE_ISOLATE);

IMHO, ideal solution is that we remove MIGRATE_ISOLATE type totally in buddy.
...
...
***

Of course, We can choose this approach as interim.
What do you think about it, Fujitsu guys?


> set_migratetype_isolate(p);
> /*
> * When the target page is a free hugepage, just remove it
> @@ -1406,6 +1407,7 @@ static int get_any_page(struct page *p, unsigned
> long pfn, int flags)
> }
> unset_migratetype_isolate(p, MIGRATE_MOVABLE);
> unlock_memory_hotplug();
> + atomic_dec(&is_migrate_isolated);
> return ret;
> }
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0d7e3ec..cd7805c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -892,6 +892,7 @@ static int __ref offline_pages(unsigned long start_pfn,
> nr_pages = end_pfn - start_pfn;
>
> /* set above range as isolated */
> + atomic_inc(&is_migrate_isolated);
> ret = start_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
> if (ret)
> goto out;
> @@ -958,6 +959,7 @@ repeat:
> offline_isolated_pages(start_pfn, end_pfn);
> /* reset pagetype flags and makes migrate type to be MOVABLE */
> undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
> + atomic_dec(&is_migrate_isolated);
> /* removal success */
> zone->present_pages -= offlined_pages;
> zone->zone_pgdat->node_present_pages -= offlined_pages;
> @@ -986,6 +988,7 @@ failed_removal:
> undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>
> out:
> + atomic_dec(&is_migrate_isolated);
> unlock_memory_hotplug();
> return ret;
> }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4403009..f549361 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1632,6 +1632,28 @@ bool zone_watermark_ok_safe(struct zone *z, int
> order, unsigned long mark,
> if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
>
> +#if defined CONFIG_CMA || CONFIG_MEMORY_HOTPLUG
> + if (unlikely(atomic_read(is_migrate_isolated)) {
> + unsigned long flags;
> + spin_lock_irqsave(&z->lock, flags);
> + for (order = MAX_ORDER - 1; order >= 0; order--) {
> + struct free_area *area = &z->free_area[order];
> + long count = 0;
> + struct list_head *curr;
> +
> + list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
> + count++;
> +
> + free_pages -= (count << order);
> + if (free_pages < 0) {
> + free_pages = 0;
> + break;
> + }
> + }
> + spin_unlock_irqrestore(&z->lock, flags);
> + }
> +#endif
> +
> return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
> free_pages);
> }
> @@ -5785,6 +5807,7 @@ int alloc_contig_range(unsigned long start,
> unsigned long end,
> * put back to page allocator so that buddy can use them.
> */
>
> + atomic_inc(&is_migrate_isolated);
> ret = start_isolate_page_range(pfn_max_align_down(start),
> pfn_max_align_up(end), migratetype);
> if (ret)
> @@ -5854,6 +5877,7 @@ int alloc_contig_range(unsigned long start,
> unsigned long end,
> done:
> undo_isolate_page_range(pfn_max_align_down(start),
> pfn_max_align_up(end), migratetype);
> + atomic_dec(&is_migrate_isolated);
> return ret;
> }
>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index c9f0477..e8eb241 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -19,6 +19,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
> return pfn_to_page(pfn + i);
> }
>
> +atomic_t is_migrate_isolated;
> +
> /*
> * start_isolate_page_range() -- make page-allocation-type of range of pages
> * to be MIGRATE_ISOLATE.
>
>
>> It is still racy as you already mentioned and I don't think it's trivial.
>> Direct reclaim can't wake up kswapd forever by current fragile zone->all_unreclaimable.
>> So it's a livelock.
>> Then, do you want to fix this problem by your patch[1]?
>>
>> It could solve the livelock by OOM kill if we apply your patch[1] but still doesn't wake up
>> kswapd although it's not critical. Okay. Then, please write down this problem in detail
>> in your patch's changelog and resend, please.
>>
>> [1] http://lkml.org/lkml/2012/6/14/74
>>
>> --
>> Kind regards,
>> Minchan Kim



--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


minchan at kernel

Jun 21, 2012, 6:20 PM

Post #18 of 30 (598 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

Hi Aaditya,

On 06/21/2012 08:02 PM, Aaditya Kumar wrote:

> On Thu, Jun 21, 2012 at 10:25 AM, Minchan Kim <minchan [at] kernel> wrote:
>> On 06/21/2012 11:45 AM, KOSAKI Motohiro wrote:
>>
>>> On Wed, Jun 20, 2012 at 9:55 PM, Minchan Kim <minchan [at] kernel> wrote:
>>>> On 06/21/2012 10:39 AM, KOSAKI Motohiro wrote:
>>>>
>>>>>>> number of isolate page block is almost always 0. then if we have such counter,
>>>>>>> we almost always can avoid zone->lock. Just idea.
>>>>>>
>>>>>> Yeb. I thought about it but unfortunately we can't have a counter for MIGRATE_ISOLATE.
>>>>>> Because we have to tweak in page free path for pages which are going to free later after we
>>>>>> mark pageblock type to MIGRATE_ISOLATE.
>>>>>
>>>>> I mean,
>>>>>
>>>>> if (nr_isolate_pageblock != 0)
>>>>> free_pages -= nr_isolated_free_pages(); // your counting logic
>>>>>
>>>>> return __zone_watermark_ok(z, alloc_order, mark,
>>>>> classzone_idx, alloc_flags, free_pages);
>>>>>
>>>>>
>>>>> I don't think this logic affect your race. zone_watermark_ok() is already
>>>>> racy. then new little race is no big matter.
>>>>
>>>>
>>>> It seems my explanation wasn't enough. :(
>>>> I already understand your intention but we can't make nr_isolate_pageblock.
>>>> Because we should count two type of free pages.
>>>
>>> I mean, move_freepages_block increment number of page *block*, not pages.
>>> number of free *pages* are counted by zone_watermark_ok_safe().
>>>
>>>
>>>> 1. Already freed page so they are already in buddy list.
>>>> Of course, we can count it with return value of move_freepages_block(zone, page, MIGRATE_ISOLATE) easily.
>>>>
>>>> 2. Will be FREEed page by do_migrate_range.
>>>> It's a _PROBLEM_. For it, we should tweak free path. No?
>>>
>>> No.
>>>
>>>
>>>> If All of pages are PageLRU when hot-plug happens(ie, 2), nr_isolate_pagblock is zero and
>>>> zone_watermk_ok_safe can't do his role.
>>>
>>> number of isolate pageblock don't depend on number of free pages. It's
>>> a concept of
>>> an attribute of PFN range.
>>
>>
>> It seems you mean is_migrate_isolate as a just flag, NOT nr_isolate_pageblock.
>> So do you mean this?
>>
>> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
>> index 3bdcab3..7f4d19c 100644
>> --- a/include/linux/page-isolation.h
>> +++ b/include/linux/page-isolation.h
>> @@ -1,6 +1,7 @@
>> #ifndef __LINUX_PAGEISOLATION_H
>> #define __LINUX_PAGEISOLATION_H
>>
>> +extern bool is_migrate_isolate;
>> /*
>> * Changes migrate type in [.start_pfn, end_pfn) to be MIGRATE_ISOLATE.
>> * If specified range includes migrate types other than MOVABLE or CMA,
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index d2a515d..b997cb3 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1756,6 +1756,27 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long ma
>> if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
>> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
>>
>> +#if defined CONFIG_CMA || CONFIG_MEMORY_HOTPLUG
>> + if (unlikely(is_migrate_isolate)) {
>> + unsigned long flags;
>> + spin_lock_irqsave(&z->lock, flags);
>> + for (order = MAX_ORDER - 1; order >= 0; order--) {
>> + struct free_area *area = &z->free_area[order];
>> + long count = 0;
>> + struct list_head *curr;
>> +
>> + list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
>> + count++;
>> +
>> + free_pages -= (count << order);
>> + if (free_pages < 0) {
>> + free_pages = 0;
>> + break;
>> + }
>> + }
>> + spin_unlock_irqrestore(&z->lock, flags);
>> + }
>> +#endif
>> return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
>> free_pages);
>> }
>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>> index c9f0477..212e526 100644
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -19,6 +19,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>> return pfn_to_page(pfn + i);
>> }
>>
>> +bool is_migrate_isolate = false;
>> +
>> /*
>> * start_isolate_page_range() -- make page-allocation-type of range of pages
>> * to be MIGRATE_ISOLATE.
>> @@ -43,6 +45,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>> BUG_ON((start_pfn) & (pageblock_nr_pages - 1));
>> BUG_ON((end_pfn) & (pageblock_nr_pages - 1));
>>
>> + is_migrate_isolate = true;
>> +
>> for (pfn = start_pfn;
>> pfn < end_pfn;
>> pfn += pageblock_nr_pages) {
>> @@ -59,6 +63,7 @@ undo:
>> pfn += pageblock_nr_pages)
>> unset_migratetype_isolate(pfn_to_page(pfn), migratetype);
>>
>> + is_migrate_isolate = false;
>> return -EBUSY;
>> }
>>
>> @@ -80,6 +85,9 @@ int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>> continue;
>> unset_migratetype_isolate(page, migratetype);
>> }
>> +
>> + is_migrate_isolate = false;
>> +
>> return 0;
>> }
>> /*
>>
>
> Hello Minchan,
>
> Sorry for delayed response.
>
> Instead of above how about something like this:
>
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 3bdcab3..fe9215f 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -34,4 +34,6 @@ extern int set_migratetype_isolate(struct page *page);
> extern void unset_migratetype_isolate(struct page *page, unsigned migratetype);
>
>
> +extern atomic_t is_migrate_isolated;

> +

> #endif
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index ab1e714..e076fa2 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1381,6 +1381,7 @@ static int get_any_page(struct page *p, unsigned
> long pfn, int flags)
> * Isolate the page, so that it doesn't get reallocated if it
> * was free.
> */
> + atomic_inc(&is_migrate_isolated);


I didn't take a detail look in your patch yet.

Yes. In my patch, I missed several caller.
It was just a patch for showing my intention, NOT formal patch.
But I admit I didn't consider nesting case. brain-dead :(
Technically other problem about this is atomic doesn't imply memory barrier so
we need barrier.

But the concern about this approach is following as
Copy/Paste from my reply of Kame.

***
But the concern about second approach is how to make sure matched count increase/decrease of nr_isolated_areas.
I mean how to make sure nr_isolated_areas would be zero when isolation is done.
Of course, we can investigate all of current caller and make sure they don't make mistake
now. But it's very error-prone if we consider future's user.
So we might need test_set_pageblock_migratetype(page, MIGRATE_ISOLATE);

IMHO, ideal solution is that we remove MIGRATE_ISOLATE type totally in buddy.
...
...
***

Of course, We can choose this approach as interim.
What do you think about it, Fujitsu guys?


> set_migratetype_isolate(p);
> /*
> * When the target page is a free hugepage, just remove it
> @@ -1406,6 +1407,7 @@ static int get_any_page(struct page *p, unsigned
> long pfn, int flags)
> }
> unset_migratetype_isolate(p, MIGRATE_MOVABLE);
> unlock_memory_hotplug();
> + atomic_dec(&is_migrate_isolated);
> return ret;
> }
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0d7e3ec..cd7805c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -892,6 +892,7 @@ static int __ref offline_pages(unsigned long start_pfn,
> nr_pages = end_pfn - start_pfn;
>
> /* set above range as isolated */
> + atomic_inc(&is_migrate_isolated);
> ret = start_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
> if (ret)
> goto out;
> @@ -958,6 +959,7 @@ repeat:
> offline_isolated_pages(start_pfn, end_pfn);
> /* reset pagetype flags and makes migrate type to be MOVABLE */
> undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
> + atomic_dec(&is_migrate_isolated);
> /* removal success */
> zone->present_pages -= offlined_pages;
> zone->zone_pgdat->node_present_pages -= offlined_pages;
> @@ -986,6 +988,7 @@ failed_removal:
> undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>
> out:
> + atomic_dec(&is_migrate_isolated);
> unlock_memory_hotplug();
> return ret;
> }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4403009..f549361 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1632,6 +1632,28 @@ bool zone_watermark_ok_safe(struct zone *z, int
> order, unsigned long mark,
> if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
>
> +#if defined CONFIG_CMA || CONFIG_MEMORY_HOTPLUG
> + if (unlikely(atomic_read(is_migrate_isolated)) {
> + unsigned long flags;
> + spin_lock_irqsave(&z->lock, flags);
> + for (order = MAX_ORDER - 1; order >= 0; order--) {
> + struct free_area *area = &z->free_area[order];
> + long count = 0;
> + struct list_head *curr;
> +
> + list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
> + count++;
> +
> + free_pages -= (count << order);
> + if (free_pages < 0) {
> + free_pages = 0;
> + break;
> + }
> + }
> + spin_unlock_irqrestore(&z->lock, flags);
> + }
> +#endif
> +
> return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
> free_pages);
> }
> @@ -5785,6 +5807,7 @@ int alloc_contig_range(unsigned long start,
> unsigned long end,
> * put back to page allocator so that buddy can use them.
> */
>
> + atomic_inc(&is_migrate_isolated);
> ret = start_isolate_page_range(pfn_max_align_down(start),
> pfn_max_align_up(end), migratetype);
> if (ret)
> @@ -5854,6 +5877,7 @@ int alloc_contig_range(unsigned long start,
> unsigned long end,
> done:
> undo_isolate_page_range(pfn_max_align_down(start),
> pfn_max_align_up(end), migratetype);
> + atomic_dec(&is_migrate_isolated);
> return ret;
> }
>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index c9f0477..e8eb241 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -19,6 +19,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
> return pfn_to_page(pfn + i);
> }
>
> +atomic_t is_migrate_isolated;
> +
> /*
> * start_isolate_page_range() -- make page-allocation-type of range of pages
> * to be MIGRATE_ISOLATE.
>
>
>> It is still racy as you already mentioned and I don't think it's trivial.
>> Direct reclaim can't wake up kswapd forever by current fragile zone->all_unreclaimable.
>> So it's a livelock.
>> Then, do you want to fix this problem by your patch[1]?
>>
>> It could solve the livelock by OOM kill if we apply your patch[1] but still doesn't wake up
>> kswapd although it's not critical. Okay. Then, please write down this problem in detail
>> in your patch's changelog and resend, please.
>>
>> [1] http://lkml.org/lkml/2012/6/14/74
>>
>> --
>> Kind regards,
>> Minchan Kim



--
Kind regards,
Minchan Kim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


aaditya.kumar.30 at gmail

Jun 21, 2012, 7:08 PM

Post #19 of 30 (602 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

On Fri, Jun 22, 2012 at 6:50 AM, Minchan Kim <minchan [at] kernel> wrote:
> Hi Aaditya,
>
> On 06/21/2012 08:02 PM, Aaditya Kumar wrote:
>
>> On Thu, Jun 21, 2012 at 10:25 AM, Minchan Kim <minchan [at] kernel> wrote:
>>> On 06/21/2012 11:45 AM, KOSAKI Motohiro wrote:
>>>
>>>> On Wed, Jun 20, 2012 at 9:55 PM, Minchan Kim <minchan [at] kernel> wrote:
>>>>> On 06/21/2012 10:39 AM, KOSAKI Motohiro wrote:
>>>>>
>>>>>>>> number of isolate page block is almost always 0. then if we have such counter,
>>>>>>>> we almost always can avoid zone->lock. Just idea.
>>>>>>>
>>>>>>> Yeb. I thought about it but unfortunately we can't have a counter for MIGRATE_ISOLATE.
>>>>>>> Because we have to tweak in page free path for pages which are going to free later after we
>>>>>>> mark pageblock type to MIGRATE_ISOLATE.
>>>>>>
>>>>>> I mean,
>>>>>>
>>>>>> if (nr_isolate_pageblock != 0)
>>>>>> free_pages -= nr_isolated_free_pages(); // your counting logic
>>>>>>
>>>>>> return __zone_watermark_ok(z, alloc_order, mark,
>>>>>> classzone_idx, alloc_flags, free_pages);
>>>>>>
>>>>>>
>>>>>> I don't think this logic affect your race. zone_watermark_ok() is already
>>>>>> racy. then new little race is no big matter.
>>>>>
>>>>>
>>>>> It seems my explanation wasn't enough. :(
>>>>> I already understand your intention but we can't make nr_isolate_pageblock.
>>>>> Because we should count two type of free pages.
>>>>
>>>> I mean, move_freepages_block increment number of page *block*, not pages.
>>>> number of free *pages* are counted by zone_watermark_ok_safe().
>>>>
>>>>
>>>>> 1. Already freed page so they are already in buddy list.
>>>>> Of course, we can count it with return value of move_freepages_block(zone, page, MIGRATE_ISOLATE) easily.
>>>>>
>>>>> 2. Will be FREEed page by do_migrate_range.
>>>>> It's a _PROBLEM_. For it, we should tweak free path. No?
>>>>
>>>> No.
>>>>
>>>>
>>>>> If All of pages are PageLRU when hot-plug happens(ie, 2), nr_isolate_pagblock is zero and
>>>>> zone_watermk_ok_safe can't do his role.
>>>>
>>>> number of isolate pageblock don't depend on number of free pages. It's
>>>> a concept of
>>>> an attribute of PFN range.
>>>
>>>
>>> It seems you mean is_migrate_isolate as a just flag, NOT nr_isolate_pageblock.
>>> So do you mean this?
>>>
>>> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
>>> index 3bdcab3..7f4d19c 100644
>>> --- a/include/linux/page-isolation.h
>>> +++ b/include/linux/page-isolation.h
>>> @@ -1,6 +1,7 @@
>>> #ifndef __LINUX_PAGEISOLATION_H
>>> #define __LINUX_PAGEISOLATION_H
>>>
>>> +extern bool is_migrate_isolate;
>>> /*
>>> * Changes migrate type in [.start_pfn, end_pfn) to be MIGRATE_ISOLATE.
>>> * If specified range includes migrate types other than MOVABLE or CMA,
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index d2a515d..b997cb3 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1756,6 +1756,27 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long ma
>>> if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
>>> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
>>>
>>> +#if defined CONFIG_CMA || CONFIG_MEMORY_HOTPLUG
>>> + if (unlikely(is_migrate_isolate)) {
>>> + unsigned long flags;
>>> + spin_lock_irqsave(&z->lock, flags);
>>> + for (order = MAX_ORDER - 1; order >= 0; order--) {
>>> + struct free_area *area = &z->free_area[order];
>>> + long count = 0;
>>> + struct list_head *curr;
>>> +
>>> + list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
>>> + count++;
>>> +
>>> + free_pages -= (count << order);
>>> + if (free_pages < 0) {
>>> + free_pages = 0;
>>> + break;
>>> + }
>>> + }
>>> + spin_unlock_irqrestore(&z->lock, flags);
>>> + }
>>> +#endif
>>> return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
>>> free_pages);
>>> }
>>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>>> index c9f0477..212e526 100644
>>> --- a/mm/page_isolation.c
>>> +++ b/mm/page_isolation.c
>>> @@ -19,6 +19,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>> return pfn_to_page(pfn + i);
>>> }
>>>
>>> +bool is_migrate_isolate = false;
>>> +
>>> /*
>>> * start_isolate_page_range() -- make page-allocation-type of range of pages
>>> * to be MIGRATE_ISOLATE.
>>> @@ -43,6 +45,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>> BUG_ON((start_pfn) & (pageblock_nr_pages - 1));
>>> BUG_ON((end_pfn) & (pageblock_nr_pages - 1));
>>>
>>> + is_migrate_isolate = true;
>>> +
>>> for (pfn = start_pfn;
>>> pfn < end_pfn;
>>> pfn += pageblock_nr_pages) {
>>> @@ -59,6 +63,7 @@ undo:
>>> pfn += pageblock_nr_pages)
>>> unset_migratetype_isolate(pfn_to_page(pfn), migratetype);
>>>
>>> + is_migrate_isolate = false;
>>> return -EBUSY;
>>> }
>>>
>>> @@ -80,6 +85,9 @@ int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>> continue;
>>> unset_migratetype_isolate(page, migratetype);
>>> }
>>> +
>>> + is_migrate_isolate = false;
>>> +
>>> return 0;
>>> }
>>> /*
>>>
>>
>> Hello Minchan,
>>
>> Sorry for delayed response.
>>
>> Instead of above how about something like this:
>>
>> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
>> index 3bdcab3..fe9215f 100644
>> --- a/include/linux/page-isolation.h
>> +++ b/include/linux/page-isolation.h
>> @@ -34,4 +34,6 @@ extern int set_migratetype_isolate(struct page *page);
>> extern void unset_migratetype_isolate(struct page *page, unsigned migratetype);
>>
>>
>> +extern atomic_t is_migrate_isolated;
>
>> +
>
>> #endif
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index ab1e714..e076fa2 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1381,6 +1381,7 @@ static int get_any_page(struct page *p, unsigned
>> long pfn, int flags)
>> * Isolate the page, so that it doesn't get reallocated if it
>> * was free.
>> */
>> + atomic_inc(&is_migrate_isolated);
>
>
> I didn't take a detail look in your patch yet.

Hi Minchan,

I think looking at kamezawa-san's approach (I copied below), it is
equivalent or rather a better approach than me,
and I agree with this approach, So, please ignore my previous patch.

(From kamezawa-san's previous post:)

***
As you shown, it seems to be not difficult to counting free pages
under MIGRATE_ISOLATE.
And we can know the zone contains MIGRATE_ISOLATE area or not by simple check.
for example.
==
set_pageblock_migratetype(page, MIGRATE_ISOLATE);
move_freepages_block(zone, page, MIGRATE_ISOLATE);
zone->nr_isolated_areas++;
=

Then, the solution will be adding a function like following
=
u64 zone_nr_free_pages(struct zone *zone) {
unsigned long free_pages;

free_pages = zone_page_state(NR_FREE_PAGES);
if (unlikely(z->nr_isolated_areas)) {
isolated = count_migrate_isolated_pages(zone);
free_pages -= isolated;
}
return free_pages;
}
=

***

> Yes. In my patch, I missed several caller.
> It was just a patch for showing my intention, NOT formal patch.
> But I admit I didn't consider nesting case. brain-dead :(
> Technically other problem about this is atomic doesn't imply memory barrier so
> we need barrier.
>
> But the concern about this approach is following as
> Copy/Paste from my reply of Kame.
>
> ***
> But the concern about second approach is how to make sure matched count increase/decrease of nr_isolated_areas.
> I mean how to make sure nr_isolated_areas would be zero when isolation is done.
> Of course, we can investigate all of current caller and make sure they don't make mistake
> now. But it's very error-prone if we consider future's user.
> So we might need test_set_pageblock_migratetype(page, MIGRATE_ISOLATE);
>
> IMHO, ideal solution is that we remove MIGRATE_ISOLATE type totally in buddy.
> ...
> ...
> ***
>
> Of course, We can choose this approach as interim.
> What do you think about it, Fujitsu guys?
>
>
>> set_migratetype_isolate(p);
>> /*
>> * When the target page is a free hugepage, just remove it
>> @@ -1406,6 +1407,7 @@ static int get_any_page(struct page *p, unsigned
>> long pfn, int flags)
>> }
>> unset_migratetype_isolate(p, MIGRATE_MOVABLE);
>> unlock_memory_hotplug();
>> + atomic_dec(&is_migrate_isolated);
>> return ret;
>> }
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 0d7e3ec..cd7805c 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -892,6 +892,7 @@ static int __ref offline_pages(unsigned long start_pfn,
>> nr_pages = end_pfn - start_pfn;
>>
>> /* set above range as isolated */
>> + atomic_inc(&is_migrate_isolated);
>> ret = start_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>> if (ret)
>> goto out;
>> @@ -958,6 +959,7 @@ repeat:
>> offline_isolated_pages(start_pfn, end_pfn);
>> /* reset pagetype flags and makes migrate type to be MOVABLE */
>> undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>> + atomic_dec(&is_migrate_isolated);
>> /* removal success */
>> zone->present_pages -= offlined_pages;
>> zone->zone_pgdat->node_present_pages -= offlined_pages;
>> @@ -986,6 +988,7 @@ failed_removal:
>> undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>>
>> out:
>> + atomic_dec(&is_migrate_isolated);
>> unlock_memory_hotplug();
>> return ret;
>> }
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 4403009..f549361 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1632,6 +1632,28 @@ bool zone_watermark_ok_safe(struct zone *z, int
>> order, unsigned long mark,
>> if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
>> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
>>
>> +#if defined CONFIG_CMA || CONFIG_MEMORY_HOTPLUG
>> + if (unlikely(atomic_read(is_migrate_isolated)) {
>> + unsigned long flags;
>> + spin_lock_irqsave(&z->lock, flags);
>> + for (order = MAX_ORDER - 1; order >= 0; order--) {
>> + struct free_area *area = &z->free_area[order];
>> + long count = 0;
>> + struct list_head *curr;
>> +
>> + list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
>> + count++;
>> +
>> + free_pages -= (count << order);
>> + if (free_pages < 0) {
>> + free_pages = 0;
>> + break;
>> + }
>> + }
>> + spin_unlock_irqrestore(&z->lock, flags);
>> + }
>> +#endif
>> +
>> return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
>> free_pages);
>> }
>> @@ -5785,6 +5807,7 @@ int alloc_contig_range(unsigned long start,
>> unsigned long end,
>> * put back to page allocator so that buddy can use them.
>> */
>>
>> + atomic_inc(&is_migrate_isolated);
>> ret = start_isolate_page_range(pfn_max_align_down(start),
>> pfn_max_align_up(end), migratetype);
>> if (ret)
>> @@ -5854,6 +5877,7 @@ int alloc_contig_range(unsigned long start,
>> unsigned long end,
>> done:
>> undo_isolate_page_range(pfn_max_align_down(start),
>> pfn_max_align_up(end), migratetype);
>> + atomic_dec(&is_migrate_isolated);
>> return ret;
>> }
>>
>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>> index c9f0477..e8eb241 100644
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -19,6 +19,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>> return pfn_to_page(pfn + i);
>> }
>>
>> +atomic_t is_migrate_isolated;
>> +
>> /*
>> * start_isolate_page_range() -- make page-allocation-type of range of pages
>> * to be MIGRATE_ISOLATE.
>>
>>
>>> It is still racy as you already mentioned and I don't think it's trivial.
>>> Direct reclaim can't wake up kswapd forever by current fragile zone->all_unreclaimable.
>>> So it's a livelock.
>>> Then, do you want to fix this problem by your patch[1]?
>>>
>>> It could solve the livelock by OOM kill if we apply your patch[1] but still doesn't wake up
>>> kswapd although it's not critical. Okay. Then, please write down this problem in detail
>>> in your patch's changelog and resend, please.
>>>
>>> [1] http://lkml.org/lkml/2012/6/14/74
>>>
>>> --
>>> Kind regards,
>>> Minchan Kim
>
>
>
> --
> Kind regards,
> Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


minchan at kernel

Jun 21, 2012, 11:45 PM

Post #20 of 30 (593 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

On 06/22/2012 10:05 AM, Minchan Kim wrote:

> Second approach which is suggested by KOSAKI is what you mentioned.
> But the concern about second approach is how to make sure matched count increase/decrease of nr_isolated_areas.
> I mean how to make sure nr_isolated_areas would be zero when isolation is done.
> Of course, we can investigate all of current caller and make sure they don't make mistake
> now. But it's very error-prone if we consider future's user.
> So we might need test_set_pageblock_migratetype(page, MIGRATE_ISOLATE);


It's an implementation about above approach.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bf3404e..3e9a9e1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -474,6 +474,11 @@ struct zone {
* rarely used fields:
*/
const char *name;
+ /*
+ * the number of MIGRATE_ISOLATE pageblock
+ * We need this for accurate free page counting.
+ */
+ atomic_t nr_migrate_isolate;
} ____cacheline_internodealigned_in_smp;

typedef enum {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c29b1c..6cb1f9f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -219,6 +219,11 @@ EXPORT_SYMBOL(nr_online_nodes);

int page_group_by_mobility_disabled __read_mostly;

+/*
+ * NOTE:
+ * Don't use set_pageblock_migratetype(page, MIGRATE_ISOLATE) direclty.
+ * Instead, use {un}set_pageblock_isolate.
+ */
void set_pageblock_migratetype(struct page *page, int migratetype)
{
if (unlikely(page_group_by_mobility_disabled))
@@ -1622,6 +1627,28 @@ bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
zone_page_state(z, NR_FREE_PAGES));
}

+unsigned long migrate_isolate_pages(struct zone *zone)
+{
+ unsigned long nr_pages = 0;
+
+ if (unlikely(atomic_read(&zone->nr_migrate_isolate))) {
+ unsigned long flags;
+ int order;
+ spin_lock_irqsave(&zone->lock, flags);
+ for (order = 0; order < MAX_ORDER; order++) {
+ struct free_area *area = &zone->free_area[order];
+ long count = 0;
+ struct list_head *curr;
+
+ list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
+ count++;
+ nr_pages += (count << order);
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+ return nr_pages;
+}
+
bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
int classzone_idx, int alloc_flags)
{
@@ -1630,6 +1657,14 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);

+ /*
+ * If the zone has MIGRATE_ISOLATE type free page,
+ * we should consider it, too. Otherwise, kswapd can sleep forever.
+ */
+ free_pages -= migrate_isolate_pages(z);
+ if (free_pages < 0)
+ free_pages = 0;
+
return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
free_pages);
}
@@ -4408,6 +4443,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
lruvec_init(&zone->lruvec, zone);
zap_zone_vm_stats(zone);
zone->flags = 0;
+ atomic_set(&zone->nr_migrate_isolate, 0);
if (!size)
continue;

@@ -5555,6 +5591,45 @@ bool is_pageblock_removable_nolock(struct page *page)
return __count_immobile_pages(zone, page, 0);
}

+static void set_pageblock_isolate(struct zone *zone, struct page *page)
+{
+ int old_migratetype;
+ assert_spin_locked(&zone->lock);
+
+ if (unlikely(page_group_by_mobility_disabled)) {
+ set_pageblock_flags_group(page, MIGRATE_UNMOVABLE,
+ PB_migrate, PB_migrate_end);
+ return;
+ }
+
+ old_migratetype = get_pageblock_migratetype(page);
+ set_pageblock_flags_group(page, MIGRATE_ISOLATE,
+ PB_migrate, PB_migrate_end);
+
+ if (old_migratetype != MIGRATE_ISOLATE)
+ atomic_inc(&zone->nr_migrate_isolate);
+}
+
+static void unset_pageblock_isolate(struct zone *zone, struct page *page,
+ unsigned long migratetype)
+{
+ assert_spin_locked(&zone->lock);
+
+ if (unlikely(page_group_by_mobility_disabled)) {
+ set_pageblock_flags_group(page, migratetype,
+ PB_migrate, PB_migrate_end);
+ return;
+ }
+
+ BUG_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE);
+ BUG_ON(migratetype == MIGRATE_ISOLATE);
+
+ set_pageblock_flags_group(page, migratetype,
+ PB_migrate, PB_migrate_end);
+ atomic_dec(&zone->nr_migrate_isolate);
+ BUG_ON(atomic_read(&zone->nr_migrate_isolate) < 0);
+}
+
int set_migratetype_isolate(struct page *page)
{
struct zone *zone;
@@ -5601,7 +5676,7 @@ int set_migratetype_isolate(struct page *page)

out:
if (!ret) {
- set_pageblock_migratetype(page, MIGRATE_ISOLATE);
+ set_pageblock_isolate(zone, page);
move_freepages_block(zone, page, MIGRATE_ISOLATE);
}

@@ -5619,8 +5694,8 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
spin_lock_irqsave(&zone->lock, flags);
if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
goto out;
- set_pageblock_migratetype(page, migratetype);
move_freepages_block(zone, page, migratetype);
+ unset_pageblock_isolate(zone, page, migratetype);
out:
spin_unlock_irqrestore(&zone->lock, flags);
}
--
1.7.9.5


--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kosaki.motohiro at gmail

Jun 22, 2012, 12:22 AM

Post #21 of 30 (597 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

> Let me summary again.
>
> The problem:
>
> when hotplug offlining happens on zone A, it starts to freed page as MIGRATE_ISOLATE type in buddy.
> (MIGRATE_ISOLATE is very irony type because it's apparently on buddy but we can't allocate them)
> When the memory shortage happens during hotplug offlining, current task starts to reclaim, then wake up kswapd.
> Kswapd checks watermark, then go sleep BECAUSE current zone_watermark_ok_safe doesn't consider
> MIGRATE_ISOLATE freed page count. Current task continue to reclaim in direct reclaim path without kswapd's help.
> The problem is that zone->all_unreclaimable is set by only kswapd so that current task would be looping forever
> like below.
>
> __alloc_pages_slowpath
> restart:
> wake_all_kswapd
> rebalance:
> __alloc_pages_direct_reclaim
> do_try_to_free_pages
> if global_reclaim && !all_unreclaimable
> return 1; /* It means we did did_some_progress */
> skip __alloc_pages_may_oom
> should_alloc_retry
> goto rebalance;
>
> If we apply KOSAKI's patch[1] which doesn't depends on kswapd about setting zone->all_unreclaimable,
> we can solve this problem by killing some task. But it doesn't wake up kswapd, still.
> It could be a problem still if other subsystem needs GFP_ATOMIC request.
> So kswapd should consider MIGRATE_ISOLATE when it calculate free pages before going sleep.

I agree. And I believe we should remove rebalance label and alloc
retrying should always wake up kswapd.
because wake_all_kswapd is unreliable, it have no guarantee to success
to wake up kswapd. then this
micro optimization is NOT optimization. Just trouble source. Our
memory reclaim logic has a lot of race
by design. then any reclaim code shouldn't believe some one else works fine.



> Firstly I tried to solve this problem by this.
> https://lkml.org/lkml/2012/6/20/30
> The patch's goal was to NOT increase nr_free and NR_FREE_PAGES when we free page into MIGRATE_ISOLATED.
> But it increases little overhead in higher order free page but I think it's not a big deal.
> More problem is duplicated codes for handling only MIGRATE_ISOLATE freed page.
>
> Second approach which is suggested by KOSAKI is what you mentioned.
> But the concern about second approach is how to make sure matched count increase/decrease of nr_isolated_areas.
> I mean how to make sure nr_isolated_areas would be zero when isolation is done.
> Of course, we can investigate all of current caller and make sure they don't make mistake
> now. But it's very error-prone if we consider future's user.
> So we might need test_set_pageblock_migratetype(page, MIGRATE_ISOLATE);
>
> IMHO, ideal solution is that we remove MIGRATE_ISOLATE type totally in buddy.
> For it, there is no problem to isolate already freed page in buddy allocator but the concern is how to handle
> freed page later by do_migrate_range in memory_hotplug.c.
> We can create custom putback_lru_pages
>
> put_page_hotplug(page)
> {
> int migratetype = get_pageblock_migratetype(page)
> VM_BUG_ON(migratetype != MIGRATE_ISOLATE);
> __page_cache_release(page);
> free_one_page(zone, page, 0, MIGRATE_ISOLATE);
> }
>
> putback_lru_pages_hotplug(&source)
> {
> foreach page from source
> put_page_hotplug(page)
> }
>
> do_migrate_range()
> {
> migrate_pages(&source);
> putback_lru_pages_hotplug(&source);
> }
>
> I hope this summary can help you, Kame and If I miss something, please let me know it.

I disagree this. Because of, memory hotplug intentionally don't use
stopmachine. It is because
we don't stop any system service when memory is being unpluged. That's
said various subsystem
try to allocate memory during page migration for memory unplug. IOW,
we shouldn't do_migrate_page()
is only one caller.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


aaditya.kumar.30 at gmail

Jun 22, 2012, 12:56 AM

Post #22 of 30 (595 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

On Fri, Jun 22, 2012 at 12:52 PM, KOSAKI Motohiro
<kosaki.motohiro [at] gmail> wrote:
>> Let me summary again.
>>
>> The problem:
>>
>> when hotplug offlining happens on zone A, it starts to freed page as MIGRATE_ISOLATE type in buddy.
>> (MIGRATE_ISOLATE is very irony type because it's apparently on buddy but we can't allocate them)
>> When the memory shortage happens during hotplug offlining, current task starts to reclaim, then wake up kswapd.
>> Kswapd checks watermark, then go sleep BECAUSE current zone_watermark_ok_safe doesn't consider
>> MIGRATE_ISOLATE freed page count. Current task continue to reclaim in direct reclaim path without kswapd's help.
>> The problem is that zone->all_unreclaimable is set by only kswapd so that current task would be looping forever
>> like below.
>>
>> __alloc_pages_slowpath
>> restart:
>> wake_all_kswapd
>> rebalance:
>> __alloc_pages_direct_reclaim
>> do_try_to_free_pages
>> if global_reclaim && !all_unreclaimable
>> return 1; /* It means we did did_some_progress */
>> skip __alloc_pages_may_oom
>> should_alloc_retry
>> goto rebalance;
>>
>> If we apply KOSAKI's patch[1] which doesn't depends on kswapd about setting zone->all_unreclaimable,
>> we can solve this problem by killing some task. But it doesn't wake up kswapd, still.
>> It could be a problem still if other subsystem needs GFP_ATOMIC request.
>> So kswapd should consider MIGRATE_ISOLATE when it calculate free pages before going sleep.
>
> I agree. And I believe we should remove rebalance label and alloc
> retrying should always wake up kswapd.
> because wake_all_kswapd is unreliable, it have no guarantee to success
> to wake up kswapd. then this
> micro optimization is NOT optimization. Just trouble source. Our
> memory reclaim logic has a lot of race
> by design. then any reclaim code shouldn't believe some one else works fine.
>

I think this is a better approach, since MIGRATE_ISLOATE is really a
temporary phenomenon, it makes sense to just retry allocation.
One issue however, with this approach is that it does not exactly work
for PAGE_ALLOC_COSTLY_ORDER, But well, given the
frequency of such allocation, I think may be it is an acceptable
compromise to handle such request by OOM in case of many
MIGRATE_ISOLATE
pages present.

what do you think ?

>
>> Firstly I tried to solve this problem by this.
>> https://lkml.org/lkml/2012/6/20/30
>> The patch's goal was to NOT increase nr_free and NR_FREE_PAGES when we free page into MIGRATE_ISOLATED.
>> But it increases little overhead in higher order free page but I think it's not a big deal.
>> More problem is duplicated codes for handling only MIGRATE_ISOLATE freed page.
>>
>> Second approach which is suggested by KOSAKI is what you mentioned.
>> But the concern about second approach is how to make sure matched count increase/decrease of nr_isolated_areas.
>> I mean how to make sure nr_isolated_areas would be zero when isolation is done.
>> Of course, we can investigate all of current caller and make sure they don't make mistake
>> now. But it's very error-prone if we consider future's user.
>> So we might need test_set_pageblock_migratetype(page, MIGRATE_ISOLATE);
>>
>> IMHO, ideal solution is that we remove MIGRATE_ISOLATE type totally in buddy.
>> For it, there is no problem to isolate already freed page in buddy allocator but the concern is how to handle
>> freed page later by do_migrate_range in memory_hotplug.c.
>> We can create custom putback_lru_pages
>>
>> put_page_hotplug(page)
>> {
>> int migratetype = get_pageblock_migratetype(page)
>> VM_BUG_ON(migratetype != MIGRATE_ISOLATE);
>> __page_cache_release(page);
>> free_one_page(zone, page, 0, MIGRATE_ISOLATE);
>> }
>>
>> putback_lru_pages_hotplug(&source)
>> {
>> foreach page from source
>> put_page_hotplug(page)
>> }
>>
>> do_migrate_range()
>> {
>> migrate_pages(&source);
>> putback_lru_pages_hotplug(&source);
>> }
>>
>> I hope this summary can help you, Kame and If I miss something, please let me know it.
>
> I disagree this. Because of, memory hotplug intentionally don't use
> stopmachine. It is because
> we don't stop any system service when memory is being unpluged. That's
> said various subsystem
> try to allocate memory during page migration for memory unplug. IOW,
> we shouldn't do_migrate_page()
> is only one caller.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kosaki.motohiro at gmail

Jun 22, 2012, 1:13 AM

Post #23 of 30 (595 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

On Fri, Jun 22, 2012 at 3:56 AM, Aaditya Kumar
<aaditya.kumar.30 [at] gmail> wrote:
> On Fri, Jun 22, 2012 at 12:52 PM, KOSAKI Motohiro
> <kosaki.motohiro [at] gmail> wrote:
>>> Let me summary again.
>>>
>>> The problem:
>>>
>>> when hotplug offlining happens on zone A, it starts to freed page as MIGRATE_ISOLATE type in buddy.
>>> (MIGRATE_ISOLATE is very irony type because it's apparently on buddy but we can't allocate them)
>>> When the memory shortage happens during hotplug offlining, current task starts to reclaim, then wake up kswapd.
>>> Kswapd checks watermark, then go sleep BECAUSE current zone_watermark_ok_safe doesn't consider
>>> MIGRATE_ISOLATE freed page count. Current task continue to reclaim in direct reclaim path without kswapd's help.
>>> The problem is that zone->all_unreclaimable is set by only kswapd so that current task would be looping forever
>>> like below.
>>>
>>> __alloc_pages_slowpath
>>> restart:
>>> wake_all_kswapd
>>> rebalance:
>>> __alloc_pages_direct_reclaim
>>> do_try_to_free_pages
>>> if global_reclaim && !all_unreclaimable
>>> return 1; /* It means we did did_some_progress */
>>> skip __alloc_pages_may_oom
>>> should_alloc_retry
>>> goto rebalance;
>>>
>>> If we apply KOSAKI's patch[1] which doesn't depends on kswapd about setting zone->all_unreclaimable,
>>> we can solve this problem by killing some task. But it doesn't wake up kswapd, still.
>>> It could be a problem still if other subsystem needs GFP_ATOMIC request.
>>> So kswapd should consider MIGRATE_ISOLATE when it calculate free pages before going sleep.
>>
>> I agree. And I believe we should remove rebalance label and alloc
>> retrying should always wake up kswapd.
>> because wake_all_kswapd is unreliable, it have no guarantee to success
>> to wake up kswapd. then this
>> micro optimization is NOT optimization. Just trouble source. Our
>> memory reclaim logic has a lot of race
>> by design. then any reclaim code shouldn't believe some one else works fine.
>>
>
> I think this is a better approach, since MIGRATE_ISLOATE is really a
> temporary phenomenon, it makes sense to just retry allocation.
> One issue however, with this approach is that it does not exactly work
> for PAGE_ALLOC_COSTLY_ORDER, But well, given the
> frequency of such allocation, I think may be it is an acceptable
> compromise to handle such request by OOM in case of many
> MIGRATE_ISOLATE
> pages present.
>
> what do you think ?

I think we need both change.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kosaki.motohiro at gmail

Jun 22, 2012, 7:56 PM

Post #24 of 30 (586 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

(6/22/12 2:45 AM), Minchan Kim wrote:
> On 06/22/2012 10:05 AM, Minchan Kim wrote:
>
>> Second approach which is suggested by KOSAKI is what you mentioned.
>> But the concern about second approach is how to make sure matched count increase/decrease of nr_isolated_areas.
>> I mean how to make sure nr_isolated_areas would be zero when isolation is done.
>> Of course, we can investigate all of current caller and make sure they don't make mistake
>> now. But it's very error-prone if we consider future's user.
>> So we might need test_set_pageblock_migratetype(page, MIGRATE_ISOLATE);
>
>
> It's an implementation about above approach.
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index bf3404e..3e9a9e1 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -474,6 +474,11 @@ struct zone {
> * rarely used fields:
> */
> const char *name;
> + /*
> + * the number of MIGRATE_ISOLATE pageblock
> + * We need this for accurate free page counting.
> + */
> + atomic_t nr_migrate_isolate;

#ifdef CONFIG_MEMORY_HOTPLUG?


> } ____cacheline_internodealigned_in_smp;
>
> typedef enum {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2c29b1c..6cb1f9f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -219,6 +219,11 @@ EXPORT_SYMBOL(nr_online_nodes);
>
> int page_group_by_mobility_disabled __read_mostly;
>
> +/*
> + * NOTE:
> + * Don't use set_pageblock_migratetype(page, MIGRATE_ISOLATE) direclty.
> + * Instead, use {un}set_pageblock_isolate.
> + */
> void set_pageblock_migratetype(struct page *page, int migratetype)
> {
> if (unlikely(page_group_by_mobility_disabled))
> @@ -1622,6 +1627,28 @@ bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> zone_page_state(z, NR_FREE_PAGES));
> }
>
> +unsigned long migrate_isolate_pages(struct zone *zone)
> +{
> + unsigned long nr_pages = 0;
> +
> + if (unlikely(atomic_read(&zone->nr_migrate_isolate))) {
> + unsigned long flags;
> + int order;
> + spin_lock_irqsave(&zone->lock, flags);
> + for (order = 0; order < MAX_ORDER; order++) {
> + struct free_area *area = &zone->free_area[order];
> + long count = 0;
> + struct list_head *curr;
> +
> + list_for_each(curr, &area->free_list[MIGRATE_ISOLATE])
> + count++;
> + nr_pages += (count << order);
> + }
> + spin_unlock_irqrestore(&zone->lock, flags);
> + }
> + return nr_pages;
> +}
> +
> bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
> int classzone_idx, int alloc_flags)
> {
> @@ -1630,6 +1657,14 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
> if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
> free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
>
> + /*
> + * If the zone has MIGRATE_ISOLATE type free page,
> + * we should consider it, too. Otherwise, kswapd can sleep forever.
> + */
> + free_pages -= migrate_isolate_pages(z);
> + if (free_pages < 0)
> + free_pages = 0;
> +
> return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
> free_pages);
> }
> @@ -4408,6 +4443,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
> lruvec_init(&zone->lruvec, zone);
> zap_zone_vm_stats(zone);
> zone->flags = 0;
> + atomic_set(&zone->nr_migrate_isolate, 0);
> if (!size)
> continue;
>
> @@ -5555,6 +5591,45 @@ bool is_pageblock_removable_nolock(struct page *page)
> return __count_immobile_pages(zone, page, 0);
> }
>
> +static void set_pageblock_isolate(struct zone *zone, struct page *page)
> +{
> + int old_migratetype;
> + assert_spin_locked(&zone->lock);
> +
> + if (unlikely(page_group_by_mobility_disabled)) {


We don't need this check. page_group_by_mobility_disabled is an optimization for
low memory system. but memory hotplug should work even though run on low memory.

In other words, current upstream code is buggy. :-)


> + set_pageblock_flags_group(page, MIGRATE_UNMOVABLE,
> + PB_migrate, PB_migrate_end);
> + return;
> + }
> +
> + old_migratetype = get_pageblock_migratetype(page);
> + set_pageblock_flags_group(page, MIGRATE_ISOLATE,
> + PB_migrate, PB_migrate_end);
> +
> + if (old_migratetype != MIGRATE_ISOLATE)
> + atomic_inc(&zone->nr_migrate_isolate);
> +}
> +
> +static void unset_pageblock_isolate(struct zone *zone, struct page *page,
> + unsigned long migratetype)
> +{
> + assert_spin_locked(&zone->lock);
> +
> + if (unlikely(page_group_by_mobility_disabled)) {
> + set_pageblock_flags_group(page, migratetype,
> + PB_migrate, PB_migrate_end);
> + return;
> + }
> +
> + BUG_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE);
> + BUG_ON(migratetype == MIGRATE_ISOLATE);
> +
> + set_pageblock_flags_group(page, migratetype,
> + PB_migrate, PB_migrate_end);
> + atomic_dec(&zone->nr_migrate_isolate);
> + BUG_ON(atomic_read(&zone->nr_migrate_isolate) < 0);
> +}
> +
> int set_migratetype_isolate(struct page *page)
> {
> struct zone *zone;
> @@ -5601,7 +5676,7 @@ int set_migratetype_isolate(struct page *page)
>
> out:
> if (!ret) {
> - set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> + set_pageblock_isolate(zone, page);
> move_freepages_block(zone, page, MIGRATE_ISOLATE);
> }
>
> @@ -5619,8 +5694,8 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
> spin_lock_irqsave(&zone->lock, flags);
> if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
> goto out;
> - set_pageblock_migratetype(page, migratetype);
> move_freepages_block(zone, page, migratetype);
> + unset_pageblock_isolate(zone, page, migratetype);

I don't think this order change is unnecessary. Why did you swap?


Other than that, looks very good to me.


> out:
> spin_unlock_irqrestore(&zone->lock, flags);
> }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kosaki.motohiro at gmail

Jun 22, 2012, 7:59 PM

Post #25 of 30 (591 views)
Permalink
Re: Accounting problem of MIGRATE_ISOLATED freed page [In reply to]

One more.


> +/*
> + * NOTE:
> + * Don't use set_pageblock_migratetype(page, MIGRATE_ISOLATE) direclty.
> + * Instead, use {un}set_pageblock_isolate.
> + */
> void set_pageblock_migratetype(struct page *page, int migratetype)
> {
> if (unlikely(page_group_by_mobility_disabled))

I don't think we need this comment. please just add BUG_ON.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First page Previous page 1 2 Next page Last page  View All Linux kernel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.